# Do students describe professors differently based on gender?

_Note: You can consult the solution of this live training in the file browser as `notebook-solution.ipynb`_

Language plays a crucial role in shaping our perceptions and attitudes towards gender in the workplace, in classrooms, and personal relationships. Studies have shown that gender bias in language can have a significant impact on the way people are perceived and treated. 

For example, research has found that job advertisements that use masculine-coded language tend to attract more male applicants, while those that use feminine-coded language tend to attract more female applicants. Similarly, gendered language can perpetuate differences in the classroom.

In this project, we'll using scraped student reviews from [ratemyprofessors.com](https://ratemyprofessors.com) to identify differences in language commonly used for male vs. female professors, and explore subtleties in how language in the classroom can be gendered.

This excellent [tool](https://benschmidt.org/profGender/#%7B%22database%22%3A%22RMP%22%2C%22plotType%22%3A%22pointchart%22%2C%22method%22%3A%22return_json%22%2C%22search_limits%22%3A%7B%22word%22%3A%5B%22aggressive%22%5D%2C%22department__id%22%3A%7B%22%24lte%22%3A25%7D%2C%22rHelpful%22%3A%5B1%2C2%5D%2C%22rClarity%22%3A%5B1%2C2%5D%7D%2C%22aesthetic%22%3A%7B%22x%22%3A%22WordsPerMillion%22%2C%22y%22%3A%22department%22%2C%22color%22%3A%22gender%22%7D%2C%22counttype%22%3A%5B%22WordsPerMillion%22%5D%2C%22groups%22%3A%5B%22department%22%2C%22gender%22%5D%2C%22testGroup%22%3A%22C%22%7D) created by Ben Schmidt allows us to enter the words and phrases that we find in our analysis and explore them in more depth. We'll do this at the end.

Catalyst also does some incredible work on [decoding](https://www.catalyst.org/2015/05/07/can-you-spot-the-gender-bias-in-this-job-description/) gendered language.

# 1. Scraping the web for reviews of professors

Text data––especially gendered text data, is hard to come by. Web scraping can be a helpful data collection tool when datasets are unable for this kind of work. We can write web scrapers to compile datasets on job descriptions, freelancer reviews, and, as in our use-case, professor reviews by students.

[ratemyprofessors.com](https://www.ratemyprofessors.com/professor?tid=589) provides a wonderful combination of qualitative and quantitative metrics that we can analyze.

Although the data on their websites is not labeled by gender, we'll use pronouns used by students to label professors "Male" or "Female". Of course, this approach is not perfect, as it relies on the _students'_ use of pronouns. Professors with non-binary pronouns will also be under-represented in the data, since very few reviews will have them, and so it's not trivial to write an algorithm to detect them. These are definitely important questions in the world of gender analysis though, so we encourage you to pick them up as extensions of this project!

### Task 1a. What relevant packages do we need for web scraping and reading in data?

In [1]:
# Used to open urls
from urllib.request import urlopen

# Used to parse html
from lxml import etree

# Used to pause code intermittently so that our scraper is not blocked
import time

# For data manipulation and analysis
import pandas as pd

# To access our data filenames so we can read them
import os

### Read the file 

Which professors will we be looking at?

The `web_scraping.ipynb` notebook provided in this workspace provides some code using selenium that was used to find urls from [ratemyprofessors.com](https://ratemyprofessors.com) that we'll be scraping in this notebook.

For now, we'll open the file `profs_888.txt` and read each professor's url in a new line, and save this variable as `profs`.

In [2]:
with open(r'profs_1244.txt', 'r') as f:
    profs = [i.strip() for i in f.readlines()]

In [3]:
# verify
profs[:10]

['https://www.ratemyprofessors.com/professor?tid=398',
 'https://www.ratemyprofessors.com/professor?tid=589',
 'https://www.ratemyprofessors.com/professor?tid=600',
 'https://www.ratemyprofessors.com/professor?tid=608',
 'https://www.ratemyprofessors.com/professor?tid=627',
 'https://www.ratemyprofessors.com/professor?tid=670',
 'https://www.ratemyprofessors.com/professor?tid=869',
 'https://www.ratemyprofessors.com/professor?tid=934',
 'https://www.ratemyprofessors.com/professor?tid=978',
 'https://www.ratemyprofessors.com/professor?tid=1364']

###  How can we use urls to scrape relevant data about professors?

Each professor has an overall rating that looks like this
<img src="img/overall_rating_example.png"  width="400">

and a series of reviews that look like this
![Review example](img/review_example.png)

The code below can be used to iterate through all or part of the list of urls in `profs`, and scrape them for qualtiative and quantitative data. **You won't need to run through this whole list though, because the `data/` folder already contains the reviews of several professors that we have scraped for you!**

- The overall rating for the professor
- All the individual reviews written by students about the professor
- The "emotion" corresponding to each individual review: `😎 AWESOME`, `😐 AVERAGE`, or `😖 AWFUL`
- A numerical "quality" rating corresponding to each individual review

We won't be using the "difficulty" ratings shown here.

In [5]:
# USE ONLY ONE OF THE FOLLOWING FOR STATWEMENTS

# 1. Sample code to loop through the whole list of professors    
# for s in (range(40, len(profs),10)):

# 2. Sample code to loop through the first 10 professors
for s in range(0, 10, 10):

    texts = [] # Initialzie an empty array
    print((s, s+10)) # Iterate through 10 professors at a time
    
    for url in profs[s:s+10]: # Iterate through this block
        time.sleep(8) # To prevent sending too many requests at once
        r = urlopen(url) # Open URL
        htmlparser = etree.HTMLParser() # Instantiate a parser to parse HTML
        tree = etree.parse(r, htmlparser) # Parse HTML returned by the url
        
        text = tree.xpath('//*[@id="ratingsList"]/li[*]/div/div/div[3]/div[3]/text()') # Extract reviews
        ratings = tree.xpath('//*[@id="root"]/div/div/div[3]/div[2]/div[1]/div[1]/div[1]/div/div[1]') # Extract ratings
        emotion = tree.xpath('//*[@id="ratingsList"]/li[*]/div/div/div[1]/div[1]/div[2]/text()') # Extract emotion
        quality = tree.xpath('//*[@id="ratingsList"]/li[*]/div/div/div[2]/div[1]/div/div[2]') # Extract quality
        texts.append((url,
                      text,
                      [i.text for i in ratings][0],
                      emotion,
                      [i.text for i in quality],
                     )) # Append metrics to empty list

    print() # Print new line for readability
    df = pd.DataFrame(texts, columns=['professor', 'reviews', 
                                     'rating', 'emotion', 'quality'])
    df.to_csv(f'df_{s}_to_{s+10}.csv') # Write result to df in blocks of 10 professors at a time
    time.sleep(10)# Pause to prevent sending too many requests at once

(0, 10)



# 2. Reading pre-scraped data

### Task 2a. How can we read a directory of scraped professor reviews and concatenate them?

Since we have already scraped reviews from several professors for you, let's begin by concatenating all the files in the `data` folder provided. These have already been scraped for you.

Since `review`, `emotion` and `quality` are lists but were recorded in string form, we'll apply `eval()` to them to turn them back from a string into a list.

In [7]:
prof_review = pd.concat([pd.read_csv('data/'+i, index_col=0) for i in os.listdir('data')]).reset_index(drop= True)
prof_review['review'] = prof_review['review'].apply(lambda x: eval(x))
prof_review['emotion'] = prof_review['emotion'].apply(lambda x: eval(x))
prof_review['quality'] = prof_review['quality'].apply(lambda x: eval(x))

### Task 2b. What does the final shape of our DataFrame look like?

Browse the `df` below to familiarize yourself with the dataset we'll be working with. The DataFrame contains one row for each professor, containing:
- Their url
- All the raw text reviews for that professor
- Their overall rating
- All the emotion labels associated with reviews of that professor
- All quality ratings assigned to that professor

In [8]:
prof_review.head()

Unnamed: 0,prof,review,rating,emotion,quality
0,https://www.ratemyprofessors.com/professor?tid...,[I liked his class. Sure he can be a little bo...,4.0,"[awesome, awful, average, awesome, awesome, aw...","[4.0, 2.5, 3.0, 5.0, 5.0, 4.5]"
1,https://www.ratemyprofessors.com/professor?tid...,[Dr. Gao is very knowledgeable. His enthusias...,2.8,"[average, awesome, awful, awful, average, awful]","[3.5, 4.0, 1.0, 2.5, 3.5, 2.5]"
2,https://www.ratemyprofessors.com/professor?tid...,[Tests are way too hard for a course that shou...,3.6,"[awful, average, awful, awful, average, awful,...","[1.0, 3.0, 2.5, 2.5, 3.5, 1.5, 3.5, 5.0, 3.5, ..."
3,https://www.ratemyprofessors.com/professor?tid...,[He is one of the best professor when it comes...,3.8,"[awesome, awesome, awesome, awesome, awesome, ...","[5.0, 4.0, 5.0, 5.0, 5.0, 4.0, 3.0, 1.0, 1.0, ..."
4,https://www.ratemyprofessors.com/professor?tid...,"[She is the worst, and she makes you not want ...",3.3,"[awful, awful, average, awful, average, awful,...","[1.0, 2.5, 3.0, 2.5, 3.5, 1.0, 3.0, 5.0, 2.0, ..."


# 3. Text Analysis

## Additional package imports are required for data visualization and NLP

In [10]:
import numpy as np # For manipulating matrices during NLP

import nltk # Natural language toolkit
from nltk.tokenize import word_tokenize # Used for breaking up strings of text (e.g. sentences) into words
from  nltk.stem.porter import PorterStemmer # Used to return the dictionary base of a word
from nltk.tokenize import WhitespaceTokenizer # Used for breaking up strings of text (e.g. sentences) into words based on white space
#nltk.download('punkt')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text # Using to extrat features from text
# Used to count the occurences of words and phrases
 # Using to extrat features from text

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')

### 3b. How can we assign gender labels to professors?

Let's write a custom function that assigns a gender label to professors based on the pronouns most commontly used for him. Specifically:
- If any of `['she', 'her', 'herself', 'shes']` occur more than 5 times across all reviews for that professor, we label the professor "F".
- If any of `['him', 'he', 'his', 'himself']` occur more than 5 times across all reviews for that professor, we label the professor "F".

In [13]:
def assign_pronoun(review_list):
    ____

In [26]:
df['pronouns'] = ____

### 3c. Are there any initial differences between male and female professors based on their overall ratings?

Let's start with a barplot.

In [10]:
plt.figure(figsize=(____))
____
plt.show()

A boxplot overlaid with a stripplot will give us a better sense of the distribution of the data.

In [14]:
plt.figure(figsize=(5,5))
____
____
plt.show()

## Task 3d. What are the most important words being used to describe professors in reviews?

Let's write a custom function that **tokenizes** and **lemmatizes** our list of words.
- **Word tokenization**: process of splitting text into individual words, called tokens. A common preprocessing step in natural language processing (NLP) so that text can be analyzed and processed more easily. Methods include whitespace tokenization, regular expression-based tokenization, and rule-based tokenization. We'll be using the `word_tokenize` tokenizer from `nltk`, with all its defaults.
- **Lemmatization**: process of reducing words to their base or dictionary form, called the lemma. Also a common pre-processing step in NLP, so that words with a common base form are treated the same way. For example, the lemma of "am" is "be", of "running" is "run", and of "mice" is "mouse".

In [67]:
def tokenize(text):
    tk = WhitespaceTokenizer()
    tokens = tk.tokenize(text)
    stems = []
    for item in tokens:
        stems.append(PorterStemmer().stem(item))
    return stems

Let's import a list of stop words, which are common English words that we will be ignoring in our analysis. `sklearn` provides a common list of stop words, and we can append additional words to this list. Below, we append pronouns, along with the words "class" and "student". Feel free to add any additional words you'd like to ignore to this list later on as you try to build upon this analysis!

In [68]:
my_stop_words = text.ENGLISH_STOP_WORDS.union(["he","she","his","her",
                                              "himself","herself", "hers","shes"
                                              "class","student"])

For the purpose of analyzing review texts, we want to move from having one row for each professor to one row for each review. Lets do this with `.explode()` from pandas.

In [69]:
df_quality = df[(df['review'].apply(len) == df['quality'].apply(len))]
q = df_quality[['pronouns','review','quality']].explode(['review','quality'], ignore_index=True).dropna()
q['quality'] = q['quality'].astype(float)

TFIDF vectorization is the process of assigning scores to each review in a document based on how frequently the word occurs, normalized by how frequently the word occurs in the dataset overall.

We'll use `TfidfVectorizer()` to generate these scores. This will return a matrix, with as many rows as reviews, and as many columns as words in our dataset.

In [78]:
vec = ____
X = ____
feature_names = ____

`X` is a sparse matrix. We'll now move into filtering X for:
- Rows with male professors and reviews of high quality 
- Rows with female professors and reviews of high quality 
- Rows with male professors and reviews of low quality 
- Rows with female professors and reviews of low quality 

We can explore feature importance in each of these to get a sense of which words and phrases are coming up most often in the data.

In [79]:
m_pos = ____
f_pos = ____
m_neg = ____
f_neg = ____

Let's have a look at what language students are using to describe male professors positively. The code below will return the 300 most important ngrams.

In [11]:
importance = ____
tfidf_feature_names = ____
tfidf_feature_names[importance[:300]]

Let's have a look at what language students are using to describe female professors positively.

In [None]:
importance = ____
tfidf_feature_names = ____
tfidf_feature_names[importance[:300]]

Let's have a look at what language students are using to describe male professors negatively.

In [12]:
importance = ____
tfidf_feature_names = ____
tfidf_feature_names[importance[:300]]

Let's have a look at what language students are using to describe female professors positively.

In [13]:
importance = ____
tfidf_feature_names = ____
tfidf_feature_names[importance[:300]]

## Congratulations on making it to the end! 
### Where to from here?
- We can feed these words into Ben Schmidt's [tool](https://benschmidt.org/profGender/#%7B%22database%22%3A%22RMP%22%2C%22plotType%22%3A%22pointchart%22%2C%22method%22%3A%22return_json%22%2C%22search_limits%22%3A%7B%22word%22%3A%5B%22his%20kids%22%2C%22her%20kids%22%5D%2C%22department__id%22%3A%7B%22%24lte%22%3A25%7D%7D%2C%22aesthetic%22%3A%7B%22x%22%3A%22WordsPerMillion%22%2C%22y%22%3A%22department%22%2C%22color%22%3A%22gender%22%7D%2C%22counttype%22%3A%5B%22WordCount%22%2C%22TotalWords%22%5D%2C%22groups%22%3A%5B%22unigram%22%5D%2C%22testGroup%22%3A%22C%22%7D) to derive insights by field.
- If you're interested in learning more about [web scraping](https://app.datacamp.com/learn/courses/web-scraping-with-python), take our courses on Web Scraping in Python
- If you're intersted in diving in to the world of Natural Language Processing, explore our [skill track](https://app.datacamp.com/learn/skill-tracks/natural-language-processing-in-python).