In [115]:
import pandas as pd
import numpy as np
import seaborn as sns
import time
import re
import pkg_resources

!pip install num2words
!pip install alt-profanity-check # much faster and more accurate than better_profanity
!pip install symspellpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [116]:
import num2words
from profanity_check import predict, predict_prob
from symspellpy import SymSpell, Verbosity

In [117]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [118]:
df = pd.read_csv('/content/drive/MyDrive/RMP/scraped_comments.csv')

In [119]:
pd.options.display.max_colwidth = 100

In [120]:
df.head(25)

Unnamed: 0,comment_id,firstName,lastName,prof_class,comment,ratingTags,date,attendanceMandatory,grade,clarityRating,difficultyRating,helpfulRating,textbookUse,thumbsDownTotal,thumbsUpTotal,wouldTakeAgain
0,UmF0aW5nLTIyODAzODMy,Marty,Beans,FINITEMATH,Very nice and understanding. A lot of homework but she only grades two problems from it and thei...,,2014-01-29 16:17:28 +0000 UTC,Y,B,4,3,4,5.0,0,0,
1,UmF0aW5nLTE5NjU2ODY4,Marty,Beans,MTH106,"She is very helpful. Gives EC if you go to tutoring. Poor office hours, but willing to help afte...",,2012-01-03 02:19:53 +0000 UTC,,,4,2,4,5.0,0,0,
2,UmF0aW5nLTEyMTIwMDcz,Marty,Beans,MATHSUMFIN,she was nice. good job her,,2006-07-31 20:31:38 +0000 UTC,,,4,4,5,,0,0,
3,UmF0aW5nLTExMTc4MTMx,Marty,Beans,DEVELOPMATH,Professor Beans is one of the best Math teachers you will have. This is coming from someone who ...,,2005-12-15 23:07:38 +0000 UTC,,,5,1,5,,0,0,
4,UmF0aW5nLTEwMTkzMTcx,Marty,Beans,MATH100,"Big smile. Big goals. Nice lady. Bright and cheery. Sweet. Helpful. Loving. Addicted to Math, ...",,2005-08-31 21:17:16 +0000 UTC,N,A-,5,1,5,,0,0,
5,UmF0aW5nLTE5NjU5Nzk1,Michael,Miller,MUS1600,Dr Miller consistently made mistakes while teaching us and our TA frequently corrected him. We w...,,2012-01-03 15:08:57 +0000 UTC,,,3,3,2,5.0,0,0,
6,UmF0aW5nLTE4NTY4MDc3,Michael,Miller,MUS2000,I am so thankful to have had the opportunity to have studied with Dr. Miller. He is an excellent...,,2011-05-10 12:48:26 +0000 UTC,,,3,4,4,1.0,0,0,
7,UmF0aW5nLTE4MTE2MjAz,Michael,Miller,MUS1600,This guy was the worst professor I had that semester. He didn't make any sense at all on his bes...,,2011-01-10 21:35:12 +0000 UTC,,,1,5,1,5.0,0,0,
8,UmF0aW5nLTE3MjU2NDIw,Michael,Miller,MUS2000,Dr. Miller is a fantastic oboist and teacher. He really knows how to connect with his students a...,,2010-05-23 17:17:05 +0000 UTC,,,4,2,5,1.0,0,0,
9,UmF0aW5nLTE2OTc0MzEz,Michael,Miller,MUS2610,"Very nice man, but an awful teacher. It makes you feel sorry for him, and sorry that you ever to...",,2010-04-09 23:58:39 +0000 UTC,,,1,3,1,3.0,0,0,


In [121]:
# Taking only relevant columns
df = df[['comment_id', 'firstName', 'lastName', 'prof_class', 'comment', 'clarityRating', 'helpfulRating']]

# Questions
- Word counts per comment and the distributions
- **For a professor, is the avg rating across their different classes similar?**
- **Important features for rating across departments?**
- What is the distribution of positive, negative, and mixed ratings?
- For comments with swears, what is rating distribution?
- Would generating meta features help with the sentiment analysis?

Possible avenues include troll detection, topic modeling, profanity detecting

# Downcasting Ratings
To potentially speed up pandas methods

In [122]:
print(df['clarityRating'].dtype)
df['clarityRating'] = pd.to_numeric(df['clarityRating'], downcast='integer')
df['helpfulRating'] = pd.to_numeric(df['helpfulRating'], downcast='integer')
print(df['clarityRating'].dtype)

int64
int8


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


# Duplicates
Handling duplicates, made possible because of the unique comment_ids RMP provides.

There weren't many duplicates anyway, but it's still helpful to remove them.

In [123]:
print(df.shape)
df['comment_id'].value_counts() # Mistakes with scraping that led to duplicates

(4121664, 7)


UmF0aW5nLTMxOTczNDg=    2
UmF0aW5nLTExNTY0NzQ4    2
UmF0aW5nLTMxNjU1MTc0    2
UmF0aW5nLTEwOTYyODIz    2
UmF0aW5nLTM3OTkwNzk=    2
                       ..
UmF0aW5nLTIzMjgwMjQx    1
UmF0aW5nLTEyMDYwMzQ4    1
UmF0aW5nLTM2OTY4NTA=    1
UmF0aW5nLTI2MjI2MzYw    1
UmF0aW5nLTQxNTc5Ng==    1
Name: comment_id, Length: 4121613, dtype: int64

In [124]:
# Dropping duplicates
print("Size before dropping duplicates:", df.shape)
df.drop_duplicates(subset=['comment_id'], inplace=True)
print("Size after dropping duplicates:", df.shape)
df['comment_id'].value_counts()

Size before dropping duplicates: (4121664, 7)
Size after dropping duplicates: (4121613, 7)


UmF0aW5nLTIyODAzODMy    1
UmF0aW5nLTIwNzgxOTI4    1
UmF0aW5nLTIzNTYzMDky    1
UmF0aW5nLTIzNDk5MzY1    1
UmF0aW5nLTIzMzYxNTA3    1
                       ..
UmF0aW5nLTMwMzQ5NzU=    1
UmF0aW5nLTE2MDUwNzM1    1
UmF0aW5nLTE1NDQ5ODM2    1
UmF0aW5nLTE3NzcwODQ=    1
UmF0aW5nLTQxNTc5Ng==    1
Name: comment_id, Length: 4121613, dtype: int64

# Clarity Ratings
The metrics students use to rate a professor. In the past, could give floating point ratings but upon scraping only ints were gathered.

In [125]:
print("Number of null ratings:", df['clarityRating'].isna().sum()) # No missing ratings
df['clarityRating'].value_counts()

Number of null ratings: 0


 5    1686243
 4     835475
 1     639677
 3     508522
 2     451695
-1          1
Name: clarityRating, dtype: int64

In [126]:
df = df[df['clarityRating'] != -1]
df['clarityRating'].value_counts()

5    1686243
4     835475
1     639677
3     508522
2     451695
Name: clarityRating, dtype: int64

# Helpful Ratings

In [127]:
print("Number of null ratings:", df['helpfulRating'].isna().sum())
df['helpfulRating'].value_counts()

Number of null ratings: 0


 5    1877128
 4     701022
 1     689579
 3     459514
 2     394368
-1          1
Name: helpfulRating, dtype: int64

In [128]:
df = df[df['helpfulRating'] != -1]
df['helpfulRating'].value_counts()

5    1877128
4     701022
1     689579
3     459514
2     394368
Name: helpfulRating, dtype: int64

# Null values
Not only can a comment be empty, but sometimes RMP labels empty comments with "No comments". We consider both to be null values that are dropped from the dataset

In [129]:
print("Rows with empty comments:", df['comment'].isna().sum()) # Empty comments exist
print("Rows named \'No Comments\':", (df['comment'] == 'No Comments').sum())

Rows with empty comments: 7310
Rows named 'No Comments': 184430


In [130]:
df.dropna(subset=['comment'], inplace=True)
df = df[df['comment'] != 'No Comments']

In [131]:
print("Rows with empty comments:", df['comment'].isna().sum())
print("Rows named \'No Comments\':", (df['comment'] == 'No Comments').sum())

Rows with empty comments: 0
Rows named 'No Comments': 0


In [132]:
df.reset_index(drop=True, inplace=True)

# Contractions
Unique idea: aggregate a large list of contractions. Then, build a spell checker and run the captured contractions in rmp comments against this

**Use fuzzy string search**

In [133]:
comments_with_diff_quote = df[df['comment'].str.contains('&quot;|&apos;|&#39;|&lsquo;|&rsquo;')]
comments_with_diff_quote.shape

(116129, 7)

In [134]:
for comment in comments_with_diff_quote['comment'].head(5):
  print(comment)

Although this woman truly is hot, don't let her looks fool you; she gives evil tests! I spent the past 4 days studying for her test and I will be gald to get a &quot;C&quot; on her test.
THIS PROFESSOR IS GREAT HIGHLY RECOMMEND TAKING HER CLASS,GOTTA SAY SHE IS &quot;HOT&quot;, IF YOU ARE LIKE ME YOU'LL END UP DAYDREAMING IN CLASS. TRIES HER BEST TO ASSIST IN ANY WAY SHE CAN AND IS ALWAYS WILLING TO GO THE EXTRA MILE.
Professor Zimmerman is a wonderful teacher.  She is inspiring and encourages her class to think &quot;outside the box.&quot; She is a powerhouse. Full of energy and ideas, she strives to have her students reach their full potential as photographic artists.
If you are an emotionally fragile young girl, her brand of &quot;personal strength&quot; (aka mean-spirited bitterness) will appeal to you.  If you're like me, however, and have never felt held down by white men or had any other Lifetime movie experiences and are generally well-adjusted, avoid her.  Either way, you won'

In [21]:
# Normalizing html entities -> single quote
df['comment'].replace('&apos;|&#39;', "'", inplace=True, regex=True)

In [22]:
# Collect all the comments containing a contraction
comments_with_contract = df[df['comment'].str.contains('[a-zA-Z]+\'[a-zA-Z]+')]
comments_with_contract.shape

(1681823, 7)

In [23]:
contract_words = df['comment'].str.extractall('(?P<word>[a-zA-Z]+\'[a-zA-Z]+)') # Should maybe cut out names
contract_words['word_lower'] = contract_words['word'].str.lower()


In [24]:
unq_contract_words = contract_words['word_lower'].unique()
print("Unique contractions:", unq_contract_words.shape)

Unique contractions: (26226,)


In [63]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have", "that'll": 'that will', "c'mon": "come on", "gov't": "government", "ta's": "teaching assistants", "there're": "there are", "who'd": "who had", "y'all": "you all", "g'luck": "good luck", "u're": "you are", "u'll": "you will", "hw's": "homeworks", "q's": "questions", "this'll": "this will", "ppt's": "powerpoints", "req'd": "required", "u'd": "you would", "there'll": "there will", "u've": "you have", "lec's": "lectures", "mc's": "multiple choices", "tf's": "true falses", "bs'd": "bullshitted", "bs'ing": "bullshitting", "eq's": "equations", "req's": "requirements", "what'd": "what would", "bs's": "bullshits"}

print(len(contraction_mapping))
contract_value_counts = contract_words['word_lower'].value_counts()
top_100_contractions = contract_value_counts.head(100).index

longest_contractions = sorted(top_100_contractions, key=lambda x: len(x), reverse=True)
print(longest_contractions[:10])

146
["professor's", "everyone's", "children's", "semester's", "shouldn't", "student's", "teacher's", "should've", "someone's", "wouldn't"]


In [26]:
# In the text file I looked at, the contractions in the frequency dictionary have freq of 300000. count_thresholds accounts for those
# Max valid prefix length observed above is about 9, so with prefix_length of 8 +- 2 it should be covered 
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=8, count_threshold=299999)
dictionary_path = pkg_resources.resource_filename("symspellpy", "frequency_dictionary_en_82_765.txt")
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

In [27]:
test_term = "cann't"
suggestions = sym_spell.lookup(test_term, Verbosity.CLOSEST,
                               max_edit_distance=2)
for suggestion in suggestions:
  print(suggestion.term)

can't


In [47]:

for contraction in top_100_contractions:
  if contraction not in contraction_mapping:
    suggestions = sym_spell.lookup(contraction, Verbosity.CLOSEST, max_edit_distance=2, include_unknown=True)
    for suggestion in suggestions:
      if suggestion.count == 300000:
        print(contraction, contract_value_counts[contraction], suggestion.term)
        break


dosen't 994 doesn't
lot's 914 let's
get's 630 let's
someone's 490 someone's
dosn't 485 doesn't
doens't 462 doesn't
arn't 421 aren't


In [82]:
top_1000_contractions = contract_value_counts.head(1000).index
for contraction in top_1000_contractions:
  if contraction not in contraction_mapping:
    suggestions = sym_spell.lookup(contraction, Verbosity.CLOSEST, max_edit_distance=1, include_unknown=True)
    for suggestion in suggestions:
      if suggestion.count == 300000:
        print(contraction, contract_value_counts[contraction], suggestion.term)
        break

dosen't 994 doesn't
lot's 914 let's
get's 630 let's
someone's 490 someone's
dosn't 485 doesn't
doens't 462 doesn't
arn't 421 aren't
her's 368 here's
does't 352 doesn't
doen't 342 doesn't
id's 314 it's
does'nt 269 doesn't
see's 249 she's
did't 208 didn't
i'v 207 i've
did'nt 189 didn't
your'e 158 you're
havn't 149 haven't
you'l 142 you'll
lee's 140 let's
t's 131 it's
i's 131 it's
iv'e 127 i've
you'r 121 you're
your're 119 you're
would'nt 111 wouldn't
something's 110 something's
wan't 109 wasn't
don's 105 don't
ge's 104 he's
din't 88 didn't
dont't 87 don't
e's 73 he's
would't 72 wouldn't
wern't 71 weren't
could'nt 61 couldn't
in's 60 it's
doesen't 55 doesn't
h's 54 he's
could't 54 couldn't
should'nt 53 shouldn't
shes's 50 she's
is'nt 50 isn't
cann't 46 can't
woudn't 43 wouldn't
dn't 41 don't
do't 41 don't
doensn't 40 doesn't
sue's 39 she's
somebody's 38 somebody's
couln't 37 couldn't
he'a 36 he'd
was't 35 wasn't
should't 35 shouldn't
do'nt 35 don't
wouln't 35 wouldn't
wasen't 34 wasn't
we

In [114]:
# WARNING: Mixed about this step, probably very bad
cnt = 0
def contraction_replacer(text, mapping):
  global sym_spell, cnt
  cnt += 1
  if cnt % 100000 == 0:
    print(cnt)
  text_split = text.split(' ')
  for i, text in enumerate(text_split):
    if re.match('[a-zA-Z]+\'[a-zA-Z]+', text):
      contraction = text.lower()
      if contraction not in contraction_mapping:
        suggestions = sym_spell.lookup(contraction, Verbosity.CLOSEST, max_edit_distance=1, include_unknown=True)
        for suggestion in suggestions:
          if suggestion.count == 300000:
            contraction = suggestion.term
            break
      text_split[i] = contraction_mapping.get(contraction, text)
  return ' '.join(text_split)

inds = (df.index & comments_with_contract.index)
df.loc[inds, 'comment'] = df.loc[inds].apply(lambda x: contraction_replacer(x['comment'], contraction_mapping), axis=1)
# df.iloc[comments_with_contract.index, 'comment']



100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1500000
1600000


In [113]:
# df.loc[[3929867]].apply(lambda x: contraction_replacer(x['comment'], contraction_mapping), axis=1)
# df.loc[[3929867]].apply(lambda x: contraction_replacer(x['comment'], contraction_mapping))
df.loc[3929867, 'comment']

'I hate going to her classes, she does not teach , she makes you do the work in groups without clearly explaining what she wants to be done'

# Shorted Words/ Slang
Hard to find, lots of domain specific

In [None]:
# TA - Teaching Assistant
# QA or Q/A - Question Answer
# Prof - Professor
# MCQ - Multiple Choice Questions
# FRQ - Free Response Questions
# ques. - questions
# a's, b's, c's, d's, e's, f's, ... - grade_{}_plural
# forms of u, u'll, u've - you
# TF's or TF - True False Question
# ppt - Powerpoint
# ppl - people

# Symbol Sequences
Trim down certain symbol sequences -> 1,
then if symbol requires multiple, can spellcheck it?


In [24]:
df['comment'].replace(r'([!\.,\-+=\/\\])\1+', '\1', inplace=True, regex=True)

# Swear words
We might want to get a count for how often these appear, or their word counts in the entire population

What might be interesting as well is viewing how these words coincide with the rating of the review. In addition, should see if swear word censorship is because of admin reviews.

In [21]:
test = 'slut'
test1 = 'sl*t'
test2 = 's**t'
print(predict_prob([test]))
print(predict_prob([test1]))
print(predict_prob([test2]))

[0.79024092]
[0.04577509]
[0.04577509]


In [55]:
def detect_profanity(comment):
  # return predict([comment]) # adding the or below takes much too long to run
  return predict([comment])

In [62]:
# df['contains_profanity'] = df.head(1000).apply(lambda row: detect_profanity(row['comment']), axis=1)
df['contains_profanity'] = predict(df['comment']) # apparently, doing this is much faster 

In [63]:
df_swear_comments = df[df['contains_profanity'] == 1]
df_swear_comments.shape

(76729, 7)

In [64]:
for comment in df_swear_comments.head(100)['comment']:
  print(comment)

hard as balls
He is a good man, and cares alot about his students. The only way you will do bad in this class is if you suck at school and life. Expect to do the homework, if you do this you will get an a
He is a good man, and cares alot about his students. The only way you will do bad in this class is if you suck at school and life.
He is a good man, cares alot about his students. The only way you will do bad in this class is if you suck at school and life.
Herb is the nicest man ever. However, he is not that great a teacher. He gives scratch quizzes and clicker quizzes but you need to teach yourself from the book. I suck at math, so this professor was not for me.
Evil man.
She failed me for no absolute reason. She is useless.
It is a easy calss,and book is so helpful you can get an A so easy from this class.but I hate shrestha he is a jerk
I don't know how people pass this course with Shrestha.  The book will not help you (I guarantee that), and Shrestha sucks at teaching.  Good Luck

In [68]:

df_swear_comments = df[df['comment'].str.contains("(fuck|shit|asshole|dick|penis|cock|vagina|pussy|cunt|retard|bitch|slut|whore)")]
df_swear_comments.reset_index(drop=True, inplace=True)
print(df_swear_comments.shape)

  


(6370, 7)


In [69]:
df_swear_comments.iloc[15:20]

Unnamed: 0,comment_id,firstName,lastName,prof_class,comment,clarityRating,contains_profanity
15,UmF0aW5nLTEwNTY4MTY1,Barb,Stengel,EDFN111,she's insane.\r\n better than the other retards over at stayer.,4,1
16,UmF0aW5nLTM0NjM2ODY=,Barb,Stengel,EDFN211,"Excellent, ten times better than all the other Stayer retards",5,1
17,UmF0aW5nLTEyMzg3ODA3,Joseph,Caspar,MAT110,U know the scene from Saw where the guy cuts out his own eye&#63; It's kinda like that. If you g...,1,1
18,UmF0aW5nLTExNjQyMTc4,John,Kim,COMPLIT,"This guy is pretty cocky, but his intelligence is the key to getting a good grade! I did not buy...",5,0
19,UmF0aW5nLTEwOTU0MDA4,Bessma,Momani,HIST130,"I had her last year. She was a great prof. Funny, good lectures, puts notes on reserve at librar...",5,0


In [70]:
list_of_comments_with_swears = list(df_swear_comments['comment'])
for comment in list_of_comments_with_swears[15:35]:
  print(comment + '\n')

she's insane.
 better than the other retards over at stayer.

Excellent, ten times better than all the other Stayer retards

U know the scene from Saw where the guy cuts out his own eye&#63; It's kinda like that. If you get him...kill yourself. He's a living corpse who relies on ur stress for sustenance.  Every day he reminds the class they are basicly retarded, and yells at u if u don't understand the lesson.  If u can, kill someone so u can get their spot in another class.

This guy is pretty cocky, but his intelligence is the key to getting a good grade! I did not buy any of the books he told us to.. I just went to every lecture and took notes on his reviews. I got a decent grade... As for future students, buy the books and read the assignments.. you have to finish them in like a week.. not much time, but for an A, y

I had her last year. She was a great prof. Funny, good lectures, puts notes on reserve at library, and her test are good. If anyone doesn't like her i imagine she mad

In [18]:
df_swear_comments['clarityRating'].value_counts()

5    1695
1    1657
4    1077
2     999
3     942
Name: clarityRating, dtype: int64

In [25]:
possible_swears = df[df['comment'].str.contains('[a-zA-Z]+[%$*@!]{2,}|[%$*@!]{2,}[a-zA-Z]+')]
possible_swears.shape

(14502, 6)

In [26]:
for comment in possible_swears['comment'].head(100):
  print(comment)

When you still have 49 people fail after a 15 point curve you know you got ****ed.
Super hard class, the hardest in our major, and you gotta pass it. He may seem scary at first, but is actually very nice. Tests are ****es though
His lectures are boring, but necessary if you want to pass the tests. If you do the sample tests, the exams are not suprising and relatively easy. His HWs are abo****ely pointless and totally irrelevant to the course, not to mention mind-boggling hard. His TAs suck, and he's not much help either.
Monotonous. Jumped all over the place, making it diffcult to take notes. If she used overheads/slideshow or handouts it would make the class easier to follow. This is picky, but she presented many key facts incorrectly (Athens won the Pello****ian War&#63
Easy Take the Saturday fast track class. 6 classes, but you can't miss any days or you will miss the notes.  You can take this class and pass even if you are ****ed if you go every time and write down what he says wi

# HTML
In the reviews, sometimes characters are encoded in the format '&amp;quot;' or '&amp;#123;' (ironic that to properly show the raw representation I had to use an html-entitiy myself)

These must be cleaned before tokenization or even more eda steps

Using regex to check against entities from https://www.freeformatter.com/html-entities.html

### HTML Entities

In [19]:
entities_broad = df[df['comment'].str.contains('&[0-9a-zA-Z#]+;')]
entities_broad.shape

(180605, 16)

In [20]:
df['comment'].replace('&([a-zA-z]+|#\d+|[a-zA-Z0-9]{3,});', '', inplace=True, regex=True)
df['comment'].replace('&#63;?', '', inplace=True, regex=True) # Specifically to deal with strange question mark remnants
# df_html_entities = df[df['comment'].str.contains('&([a-zA-z]+|#\d+);')]

In [21]:
entities_broad = df[df['comment'].str.contains('&[0-9a-zA-Z#]+;')]
entities_broad.shape

(1, 16)

In [22]:
entities_broad.reset_index(inplace=True, drop=True)
for comment in entities_broad['comment']:
  print(comment)

Great teacher. His tests are easy if you study his notes.  The midterm isn't written by him, and the final is a nationally written exam that covers chem 1&2; so be prepared for those to drop your grade.


### HTML Tags
Not too much of these in the dataset

In [23]:
comments_html_tags = df[df['comment'].str.contains('<.{1,6}?>')]
comments_html_tags.shape

(5, 16)

In [24]:
for comment in comments_html_tags['comment']:
  print(comment)

This class was alright; We had to read a novel and write an essay for every three chapters-it became very redundant and boring. I must say though, my writing greatly improved.<br> He's a nice guy.
I took his class in 1965.  It was a tag team teaching class with Jorgensen <sp>.  He was controversial then and I am glad  to see he hasn't changed.  You go Prof.
Mr. Wintz is the BEST! teacher I have ever had the privelege of taking. I will forever miss the lectures and the stories. Did I mention he loves chocolate? <grin>
Very clear and easy to understand. Notes are fill in the blank, and completed ones are provided, although after a delay. This class is not as hard as everyone makes it out to be as long as you <i> understand </i> the concepts, and not just how to get answers to specific problems. Work on the assignments without immediately looking at the solution.
Rehling is an amazing <b> teacher</b> who can come off as intimadating but is straight foraward, and sincere professor. IF you 

# Links, Phone-Numbers, Email-Addresses

In [25]:
link_inds = df['comment'].str.contains('\s*https?://\S+(\s+|$)')
phone_inds = df['comment'].str.contains('^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$')
email_inds = df['comment'].str.contains('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
print("Number of links:", df[link_inds].shape)
print("Number of phone-numbers", df[phone_inds].shape)
print("Number of email-addresses", df[email_inds].shape)

  """Entry point for launching an IPython kernel.
  


Number of links: (156, 16)
Number of phone-numbers (0, 16)
Number of email-addresses (250, 16)


In [26]:
df.loc[link_inds, 'comment']

1122       Needs to be treated for Narcissistic Personality Disorder.\r\n Symptoms include:\r\n -Disregard ...
32136      Hated this class. It was so boring and his lectures in class do not even relate to the online te...
34864      http://wps.prenhall.com/esm_audesirk_bloe_7/0,8753,1139971-,00.html - This website could outteac...
37038      If you are taking the American Century course, beware! Its some kind of propaganda, since Corke ...
57339      Please avoid at all costs. Take with Math Dept. The material of the course is not all that diffi...
                                                          ...                                                 
3911623    He's a very nice guy. Tells jokes. And he has a very useful web site                            ...
3917853    I personally enjoyed this class. Yes, his lectures may be mono-toned and fast paced but he makes...
3922972    Please see http://www.ratemyprofessors.com/ShowRatings.jsptid=55852 for Professor Ferrer's ratin...
3

In [27]:
# For some reason, using df.loc[link_inds, 'comment'] wouldn't replace
df['comment'].replace(r'\s*https?://\S+(\s+|$)', ' ', inplace = True, regex = True)
df['comment'].replace('^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$', ' ', inplace = True, regex = True)
df['comment'].replace('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', ' ', inplace=True, regex = True)

In [28]:
print("Number of links:", df.loc[link_inds,'comment'].str.contains('\s*https?://\S+(\s+|$)').sum())
print("Number of phone-numbers:", df.loc[phone_inds,'comment'].str.contains('^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$').sum())
print("Number of email-addresses:", df.loc[email_inds,'comment'].str.contains('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+').sum())

Number of links: 0
Number of phone-numbers: 0
Number of email-addresses: 0


  """Entry point for launching an IPython kernel.
  


# Dates & Times

In [29]:
date_comments = df[df['comment'].str.contains('\d{1,2}/\d{1,2}/\d+|\d{1,4}/\d{1,4}/\d*')] # flawed, doesn't grab all possible date formats
date_comments.shape

(670, 16)

In [30]:
for comment in date_comments.reset_index(drop=True).loc[:10, 'comment']:
  print(comment+'\n')

Sen~or Cortiguera is a fair, fun and interesting teacher. I had him for SPA 101, 102and 201and must said That I learned a lot.  As for the negative comment of 8/28/07 SPA 102, I must clarify that I kind of know the person who wrote it and must said that this person cannot be trusted.

I agree with the reviewer from 12/12/06 about the quizzes and tests. They are awful and pick at tiny information that you'd have to memorize from 1 sentence out of the chapter or 2 that you read. They count for most of your points in the class which is awful for your grade. She wrote the book she uses in class but lectures are so dry and boring.

Please remove my previous comments and rating 10/15/10 HIST101, together with the present one, as the misunderstanding between me and Mr. Gilchrist has been solved, thank you.

I took Martin's class before he had his doctorate, and I learned sooo much. The idiot on 12/14/2008 that said it was an awful class has limited intelligence. Martin is awesome!

I agree wo

In [31]:
df['comment'].replace('\d{1,2}/\d{1,2}/\d+|\d{1,4}/\d{1,4}/\d*', ' ', inplace=True, regex=True)

In [32]:
# -- Add code dealing with time --

# Numbers
- In near future, deal with the case where the number has commas

### Numbers greater than 100

In [33]:
comments_num = df[df['comment'].str.contains('\d+')]
comments_num.shape

(689923, 16)

In [34]:
# Get rid of numbers that are > 100
df['comment'].replace('\d{4,}', ' ', inplace=True, regex=True) # Remove numbers 4 digits and above
df['comment'].replace('[1-9][0-9][1-9]|[1-9][1-9][0-9]|[2-9]0{2}', ' ', inplace=True, regex=True) # Remove 3 digits numbers that are not 100
# df['comment'].replace('0\d{2}', ' ', inplace=True, regex=True)

In [35]:
comments_num = df[df['comment'].str.contains('\d+')]
comments_num.shape

(578415, 16)

In [36]:
# Strange numbers that start with a 0
num_zero_start = df[df['comment'].str.contains('\D0\d{1,2}|^0\d+')]
print(num_zero_start.shape)
for comment in num_zero_start.reset_index(drop=True).loc[:10, 'comment']:
  print(comment)

(10583, 16)
In o92 & 096 I had straight A's and was enjoying math until I took Larry. I think he is one of the worse teacher's I have ever had! He is rude and very condesending to anyone who doesn't understand how he teaches. He goes to fast and doesn't explain anything clearly. DONT TAKE THIS TEACHER! HE IS A HORRIBLE TEACHER!
professor larry is AWESOME!  he explains everything so well and he really cares if you understand or not.  He'll give you homework, but he doesn't collect it, his tasts are pretty easy too, all stuff from the homework.  He's not really big on the book, so buy it used if you can.  I would reccomend him to anyone.  So bummed he's not teaching 096. :(
allowed one 096 class to use notes and correct a test but not the rest of his 096 classes.
VERY AWSOME TEACHER. im taking her now for rea001. she is the best! only way to fail is: if you dont do LAB assigments and if you dont care about the class. TAKEEE HERRR   oh and DO NOT click your pens lol.
summer 07 was petru's

In [37]:
# Numbers that contain a comma
num_contain_comma = df[df['comment'].str.contains('[\d,]{2,4}')]
print(num_contain_comma.shape)

(266539, 16)


In [38]:
all_numbers = comments_num['comment'].str.extractall('(?P<number>\d+)')

In [39]:
print(all_numbers.shape)
all_numbers['number'].unique()

(956443, 1)


array(['2', '63', '40', '5', '3', '49', '15', '4', '6', '10', '1', '50',
       '100', '7', '20', '89', '8', '90', '92', '0', '17', '58', '14',
       '30', '75', '60', '70', '73', '85', '16', '12', '11', '95', '80',
       '93', '25', '096', '23', '18', '24', '19', '35', '34', '55', '9',
       '36', '42', '33', '001', '83', '64', '13', '87', '65', '91', '61',
       '52', '45', '26', '94', '07', '31', '01', '54', '27', '96', '97',
       '62', '05', '06', '57', '28', '22', '091', '093', '000', '21',
       '99', '98', '51', '37', '72', '74', '00', '68', '69', '08', '88',
       '82', '084', '085', '79', '44', '03', '098', '39', '061', '007',
       '53', '41', '81', '46', '77', '76', '86', '02', '47', '59', '04',
       '67', '66', '080', '023', '78', '095', '010', '09', '32', '84',
       '016', '43', '005', '29', '48', '38', '003', '71', '56', '020',
       '031', '032', '050', '030', '083', '082', '022', '015', '018',
       '099', '002', '017', '092', '070', '058', '060', '090', 

In [40]:
comments_with_perc = comments_num[comments_num['comment'].str.contains(' %\d')]
comments_with_perc.shape

(264, 16)

In [41]:
for comment in comments_with_perc['comment']:
  print(comment)

Cute russian math professor. 3 tests, 7 quizzes, 1 final. a C is %65. Very nice to students. Can answer questions very clearly. Never at her office though. Excellent at drawing 3d shapes. Tests are identical to homework or lecture problems. Helps out on test problems. Teaches the course at the simplest level possible.
If you think your tough and can handle a crummy instructer...good for you. But don't take her class! She is horrible! She will take your test and puke red all over it, even if your test is a high %80, she will make sure you leave with nothing over a %60! Run as far as you can from this woman!
Summaries that you must do but are simple and gives you %100 if completed! no tests or quizzes just participate and do extra credit and you have a very high chance of getting an A! he cancelled class like 5 times so that was cool too! Class is boring for the most part though
Alot of freshmen had a hard time with him because its so easy to focus on the hundreds of pages of notes from 

# Symbols


In [42]:
print(df.shape)
df_symbol_comments = df[df['comment'].str.contains("[*#@%$=_&\^/<>]+")]
print(df_symbol_comments.shape)

(3917612, 16)
(360803, 16)


In [43]:
multiple_symbols = df_symbol_comments['comment'].str.extractall('(\S*[*#@%$=_&\^/<>]{2,}\S*)')
multiple_symbols.shape

(33443, 1)

In [44]:
unique_mult_symbols = multiple_symbols.drop_duplicates()
unique_containing_percent = unique_mult_symbols[multiple_symbols[0].str.contains('%')]

  


### Ampersand (&)
Dealing with ampersands, which are very common as a substitute for "and".
More importantly, it is also used in University names or even class names.



In [45]:
# First, shrinking multiple occurences of & in a row -> single occurence
df['comment'].replace('&+', '&', inplace=True, regex=True)

In [46]:
df_amp_easy_comments = df[df['comment'].str.contains(" & ")][['comment']] # Easy case with the &'s, comments use it as substitute for 'and'
df_amp_easy_comments.shape

(88186, 1)

In [47]:
amp_contexts = df_amp_easy_comments['comment'].str.extract(r'(?P<context>.{1,19} & .{1,19})')
amp_contexts.reset_index(inplace=True, drop=True)

In [48]:
for context in amp_contexts.loc[:20, 'context']:
  print(context)

le to work with him &  study efficiently 
s is extremely nice & helpful in her offi
erman was very nice & helpful. I learned 
ll have you at ease & speaking espanol in
rs before the test, & know the study guid
class.2 online test & one final group pro
r in my night class & I've got to say...h
 them up. Open book & Open note exams, wi
ries to make it fun & keep you interested
ries to make it fun & keep you interested
gives great midterm & final reviews so ma
s gramatical errors & focuses less on con
In o92 & 096 I had straight 
ns. He's very blunt & to the point. He he
 Dr. Hixson (Journ  & Journ ) and loved g
 lost. Her lectures & materials on Blackb
s, take notes, read & its a fairly easy A
y. Read the chapter & u'll pass. Beware o
Circuits I & II Professor
other classes on BB & not had these probl
more than English   &  !


In [49]:
df['comment'].replace(' & ', ' ', inplace=True, regex=True)

In [50]:
df_amp_sub = df[df['comment'].str.contains(' & ')]
df_amp_sub.shape

(0, 16)

In [51]:
df_amp_med_comments = df[df['comment'].str.contains("[a-zA-Z]& | &[a-zA-Z]")][['comment']] # Tricky case where comments don't have proper spacing
df_amp_med_comments.shape

(2513, 1)

In [52]:
df['comment'].replace('([a-zA-Z])& ', r'\1 ', inplace=True, regex=True)
df['comment'].replace(' &([a-zA-Z])', r' \1', inplace=True, regex=True)

In [53]:
df_amp_med_comments = df[df['comment'].str.contains("[a-zA-Z]& | &[a-zA-Z]")][['comment']] # Tricky case where comments don't have proper spacing
df_amp_med_comments.shape

(0, 1)

In [54]:
common_amp_abbreviations = df['comment'].str.extract('(?P<abbrev>[a-zA-Z]&[a-zA-Z])')
common_amp_abbreviations.shape

(3917612, 1)

In [55]:
common_amp_abbreviations['abbrev'].str.lower().value_counts().head(50)

a&p    3813
q&a     298
w&m     120
a&b     106
a&m     106
m&f      98
e&m      92
m&m      85
w&l      80
s&t      79
e&e      77
f&m      75
t&f      65
d&c      64
b&w      58
e&h      55
s&q      50
s&a      48
s&s      46
s&e      44
t&t      42
i&i      42
s&p      40
s&c      35
s&b      34
s&d      32
d&d      31
s&h      31
s&m      30
c&t      29
m&a      29
c&i      29
s&l      29
e&s      28
r&w      27
s&i      26
y&p      26
t&c      26
e&c      25
e&f      25
s&n      25
e&t      24
s&f      23
a&c      23
t&p      23
r&r      23
d&f      22
e&i      22
w&j      22
r&d      22
Name: abbrev, dtype: int64

In [56]:
amp_contexts = df[df['comment'].str.contains(r'[a-zA-Z]&[a-zA-Z]')]

In [57]:
print(amp_contexts.shape)
amp_contexts.reset_index(inplace=True, drop=True)
for comment in amp_contexts.loc[:50, 'comment']:
  print(comment)

(8601, 16)
Its a hard class and well he dosent do much, the T&Q are easy if you do the homework. just do the home work like 2 or 3 times and you will pass the class
He doesn't teach the material and doesn't review for the exams he gives. You would be better off taking a different professor. I read the books and studied nonstop and still only scraped a C. I aced the lab portion of it, and that was the reason for the C. I aced microbiology and A&P before I even took his class. Find someone else.
I took this professor for Biology  . I looked up which professor to take for my A&P class and when his name popped up, it gave me a flash of PTSD. His lectures had absolutely nothing to do with the materials on the tests. I could have skipped the whole semester, read the textbook by myself and could have gotten the same grade as I received.
Do you really need to pass your a&p with a good grade after so much effort? Then do yourself a favor by not taking the class with him. Please, please and plea

### Percent (%)

In [58]:
comments_with_percent = df[df['comment'].str.contains('%')]
comments_with_percent.shape

(67398, 16)

In [59]:
percents_without_number = comments_with_percent[comments_with_percent['comment'].str.contains('[^\d]%[^\d]')]
percents_without_number.shape

(3364, 16)

In [60]:
for comment in percents_without_number['comment']:
  print(comment)

She gives u the exam questions to study for,and you study hard, go to the exam hall , look at the exam questions, answer all of the questions THINKING that ull atleast get a 90%. You wait for 2 weeks until the exam marks come out and then you see you got a 70 some % or less. N THERE lies theCATCH Her INCOMPETENT T.A.S mark unfairly on purpose!
Class was 100 % online. The course was pretty much self paced (only deadlines, no meetings). The course used Canvas and McGraw-Hill Connect/SmartBook. Reading was understandable and exams were similar to quizzes. There was a 6 page book report and and an extra credit opportunity. This class was perfect paced for me (single mom, full time job).
She is a hard grader.  She likes to trick you in the exam. She likes to questions that are not worth to ask (eg. %, RDA).  She grades term paper really hard, I found a professor to correct the paper for me, but I got a C. She is not helpful to ESL students.  When she asks for volunteers, she always thinks s

### Slashes (/ and \\)
- Used to shorten text (w/ standing for with).
- Used to represent a fraction, might be able to replace with "out of"
- Used to abbreivate "or" 

Interestingly, both slashes are sometimes used to express emoticons (:\ or :/)

In [61]:
regular_slashes = df[df['comment'].str.contains('/')]
back_slashes = df[df['comment'].str.contains(r'\\')]
print("Size of regular slashes:", regular_slashes.shape)
print("Size of back slashes:", back_slashes.shape)

Size of regular slashes: (157014, 16)
Size of back slashes: (39, 16)


In [62]:

for comment in back_slashes['comment']:
  print(comment)

Got a C+ in his class after studying like crazy for every test. Worked harder for this stupid wellness class than for all my others and still ended up bringing my gpa down. DONT take the 7:30 class. He is \ a stickler about attendance which is annoying. Not willing to meet with students outside of class and arrogant. Nice enough man but dont take
Nice guy really funny kind of Per\/  but its awesome. Great field trips but tests are hardcore way hard
Dr. Lamourelle's blackboard site is extremely unorganized. Her guidelines for writing essays, and projects are the same. She did respond to all of my emails and gave out her phone number to whoever needed it, but even her responses back were confusing. One time she didn't know where to turn in an assignment. :\. If you can keep away, KEEP AWAY!
Really good at explaining things. Seemed like a nice guy. Thought he did an all-around better job of things than Brenda Gunderson (who everyone else seems to recommend :\).
Most of the ES  students ar

In [63]:
i = 0
for comment in regular_slashes['comment']:
  print(comment)
  i += 1
  if i == 25: break

This guy is awesome.  I totally failed my first quiz and still got a B+ in the class b/c I was able to work with him  study efficiently with his guidance. Go to review sessions! Ask lots of questions! He gives great / hilarious stories in class  as examples. You will work very hard! But you'll be rewarded in the end!
nice person, terrible instructor. she know barely anything about photographic processes/ digital imaging and honestly has no right accepting huge sums of tuition to do so. if you take her class you will learn nothing about the stated subject matter of the class. as a now working professional with hindsight my advice would be, avoid at all costs.
It's funny that people would say she's tough. If you attend class, skim the chaptrs before the test, know the study guides, you'll do fine.  She even has review days. She SAYS 1/2 the test q's come from the book, but as the semester wore on, most were from the lecture. Note: She's very serious about punishing cheaters and talks fem

# Contractions

# Observations
* There are duplicates in the comments, easy to drop though
* Professor Id was not scraped, as well as University Id and department. **NEEDS TO BE DONE IN FUTURE SCRAPES**
* Clarity rating rounds down
* Clarity ratings skew positve, then heavily negative (5, 4, then 1)
* Some comments are null, and some are null with the text "No Comments".
* There are swear words in some of the reviews, though they may not have been admin reviewed. **Grab this feature when scraping**
* Comments sometimes mention their grade, sometimes enclosed in quotes. While this isn't accounted for right now, should be in the future
* Classes have abbreviations, will be hard to capture true class
* Emojis are filtered out, but how to deal with emoticons like :)?

# TODO


* Have to figure out a way to find reviews that contain words that our models don't have (slang words)
* How to deal with numbers (0-9) and normalize them; might need to just completely ignore them for now
* Abbreviations like w/ need to be expanded
* ~~**Deal with html entities, either by parsing them or removing them**~~
* Common abbreviations using & need to be dealt with (slightly challenging, because it might be like q&a which means question and answer, or & in a university name like A&M)