# Rate my Professor Project
This is Part 1 of a comprehensive RateMyProfessor project aimed at building an overview of faculty and professor ratings at Cal Poly Pomona. The goal is to provide students with accessible and aggregated insights into how professors are rated on the RateMyProfessor platform.

In this phase of the project, we focus on basic data cleaning and exploratory data analysis (EDA) to prepare the dataset for deeper analysis in future stages.

The data was collected using a custom web scraping bot built with Selenium, designed to extract information on both current and former professors at Cal Poly Pomona.

The dataset below was scraped from RateMyProfessor and contains information on faculty from Cal Poly Pomona. Each row represents a professor, along with their aggregated rating and departmental data.
Columns Explained:

Professor_ID: A unique identifier assigned to each professor.

Professor_Name: The full name of the professor.

Department: The academic department the professor belongs to.

Avg_Rating: The average rating the professor has received (typically on a 1–5 scale).

Total_Rating: The total number of ratings or reviews the professor has received.

Would_Take_Again: The percentage of students who indicated they would take the professor again.

Avg_Difficulty: The average difficulty rating given by students for the professor’s courses.

This structured dataset serves as the foundation for exploring professor performance, student satisfaction, and department-level insights at Cal Poly Pomona

In [112]:
# Importing necessary libraries 
import pandas as pd

In [118]:
# Read CSV without without initial data parsing/cleaning
df = pd.read_csv(r'\Users\Ivan\Downloads\Ratemyprofesser_dataclean\ratemyprofessors.csv')
df

Unnamed: 0,Professor_ID,Avg_Rating,Total_Ratings,Professor_Name,Department,Would_Take_Again,Avg_Difficulty
0,2073648,3.2,96,Mark Okuhata,History,49%,3.5
1,2676805,4.5,74,Melody Adejare,Communication,88%,3.3
2,54640,4.7,68,Jill Nemiro,Psychology,90%,2.5
3,2147335,3.0,35,Robert Blumenfeld,Psychology,43%,3.3
4,1087541,3.8,26,Juliana Fuqua,Psychology,75%,2.8
...,...,...,...,...,...,...,...
2559,660753,4.9,4,Paul Salomaa,Mathematics,,2.8
2560,1294559,5.0,1,Shelly Mendez,Mathematics,,2.0
2561,913059,2.3,2,Robert W. Small,Hospitality,,5.0
2562,674092,3.5,3,Wayne Wooden,Sociology,,3.0


# Data Cleaning 

This Section will focus on ensuring consistency, handling missing values, and standardizing format.

In [116]:
df.head(10)

Unnamed: 0,Professor_ID,Avg_Rating,Total_Ratings,Professor_Name,Department,Would_Take_Again,Avg_Difficulty
0,2073648,3.2,96,Mark Okuhata,History,49%,3.5
1,2676805,4.5,74,Melody Adejare,Communication,88%,3.3
2,54640,4.7,68,Jill Nemiro,Psychology,90%,2.5
3,2147335,3.0,35,Robert Blumenfeld,Psychology,43%,3.3
4,1087541,3.8,26,Juliana Fuqua,Psychology,75%,2.8
5,2652433,5.0,19,Sarah Huff,Music,100%,1.5
6,2630928,4.8,5,Tatiana Pumaccahua,Psychology,100%,1.4
7,2002633,4.9,44,Steven Camacho,English,100%,2.6
8,2171196,3.3,27,Suresh Ganapathy,Engineering,89%,2.8
9,3073381,2.3,4,Kora Tsay,Mathematics,25%,4.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2564 entries, 0 to 2563
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Professor_ID      2564 non-null   int64  
 1   Avg_Rating        2564 non-null   float64
 2   Total_Ratings     2564 non-null   int64  
 3   Professor_Name    2564 non-null   object 
 4   Department        2564 non-null   object 
 5   Would_Take_Again  2242 non-null   object 
 6   Avg_Difficulty    2564 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 140.3+ KB


After examining the data it seems that everything is correcty expect that : Would_Take_Again is an object instead of an float64

In [59]:
# Drop any duplicates within the df
df.drop_duplicates(inplace=True)

In [63]:
df

Unnamed: 0,Professor_ID,Avg_Rating,Total_Ratings,Professor_Name,Department,Would_Take_Again,Avg_Difficulty
0,2073648,3.2,96,Mark Okuhata,History,49.0,3.5
1,2676805,4.5,74,Melody Adejare,Communication,88.0,3.3
2,54640,4.7,68,Jill Nemiro,Psychology,90.0,2.5
3,2147335,3.0,35,Robert Blumenfeld,Psychology,43.0,3.3
4,1087541,3.8,26,Juliana Fuqua,Psychology,75.0,2.8
...,...,...,...,...,...,...,...
2559,660753,4.9,4,Paul Salomaa,Mathematics,,2.8
2560,1294559,5.0,1,Shelly Mendez,Mathematics,,2.0
2561,913059,2.3,2,Robert W. Small,Hospitality,,5.0
2562,674092,3.5,3,Wayne Wooden,Sociology,,3.0


In [11]:
# Checking if the Column contains any NaN/Nulls
df.isna().any()

Professor_ID        False
Avg_Rating          False
Total_Ratings       False
Professor_Name      False
Department          False
Would_Take_Again     True
Avg_Difficulty      False
dtype: bool

In [13]:
# Checking which Columns contains NaN/Nulls
df.isna().sum()

Professor_ID          0
Avg_Rating            0
Total_Ratings         0
Professor_Name        0
Department            0
Would_Take_Again    322
Avg_Difficulty        0
dtype: int64

In [15]:
# Checking an specific row 
df.iloc[2559]

Professor_ID              660753
Avg_Rating                   4.9
Total_Ratings                  4
Professor_Name      Paul Salomaa
Department           Mathematics
Would_Take_Again             NaN
Avg_Difficulty               2.8
Name: 2559, dtype: object

In [136]:
# Replaing the NaN's with a 0
df.fillna(0, inplace=True)

In [138]:
# Double Checking
df.isna().sum()

Professor_ID        0
Avg_Rating          0
Total_Ratings       0
Professor_Name      0
Department          0
Would_Take_Again    0
Avg_Difficulty      0
dtype: int64

# Handling Missing Values & Validate Column Ranges

Identify and address missing or incorrect formatting within the data frame along with ensuring columns constriants

## Given the knowledge on  the columns:
Professor_ID: Should be unique or at least consistent in format.

Avg_Rating: Must be between 0 and 5.

Total_Ratings: Cannot contain negatives

Professor_Name: Should be consistently formatted (e.g., First Last), with no extra spaces or all-caps issues.

Would_Take_Again: Should not exceed 100 (interpreted as a percentage).

Avg_Difficulty: Should not exceed 5.

In [17]:
#identify unique values within the Column to ensure no unsual values or expressions
df['Professor_ID'].unique()

array([2073648, 2676805,   54640, ...,  913059,  674092, 1166221],
      dtype=int64)

In [19]:
# Checking if any values within the column exceeds the constraint limit of 5 
df['Avg_Rating'].unique() <= 5

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

In [21]:
# Additionally confirmation
(df['Avg_Rating'].unique() <= 5).all()

True

In [23]:
#identify unique values within the Column to ensure no unsual values or expressions
df['Total_Ratings'].unique()

array([ 96,  74,  68,  35,  26,  19,   5,  44,  27,   4,  31,  28,  25,
        16,  15,   7,  14,  10,   2,  38,  36,  22,   1, 619,  58,  61,
        49,   6,   9,   8,  51,  39,  34,  12,  45, 122,  47,  20,  17,
        18,  63,  40,  37,  29,  23,  21,  13,  67,  62, 103,  79, 138,
       101,  54,  78,  41,   3,  33,  50,  55,  30,  53,  32,  56,  43,
        42,  24,  70,  72,  48,  11,  57,  46,  93,  76,  69,  87,  73,
        59,  52,  91,  81,  80,  71,  77,  83,  92,  60, 253, 118],
      dtype=int64)

In [25]:
#identify unique values within the Column to ensure no unsual values or expressions
df["Professor_Name"].unique()

array(['Mark Okuhata', 'Melody Adejare', 'Jill Nemiro', ...,
       'Robert W. Small', 'Wayne Wooden', 'Shereef Ellaboudy'],
      dtype=object)

In [27]:
# Double checking if we have any duplicates professors in different Department
df["Professor_Name"].value_counts()

Professor_Name
David Juranich           2
Pinar Tremblay           2
Barbara Gill-Mayberry    2
Dolores Arredondo        2
Shu Shang                2
                        ..
Claudia Wainer           1
Candice Huynh            1
Jeff Burke               1
Jaehoon Seong            1
Shereef Ellaboudy        1
Name: count, Length: 2537, dtype: int64

In [29]:
# Double check if we have any duplicates professors in different Department
df[df["Professor_Name"].duplicated(keep =False)].sort_values(by="Professor_Name")

Unnamed: 0,Professor_ID,Avg_Rating,Total_Ratings,Professor_Name,Department,Would_Take_Again,Avg_Difficulty
98,698145,3.7,78,Alane Daugherty,Health Science,60%,1.8
260,2507129,4.2,5,Alane Daugherty,Physical Education,80%,1.8
1823,1122840,3.1,6,Anthony Vercillo,Marketing,25%,3.0
1787,2114616,3.7,6,Anthony Vercillo,Business,20%,2.5
35,411553,1.2,61,Barbara Gill-Mayberry,English,3%,4.4
1162,1633109,1.0,3,Barbara Gill-Mayberry,English & Languages,0%,4.7
1963,706464,3.6,26,Brian Johnson,Psychology,56%,3.3
727,3039961,5.0,1,Brian Johnson,Mathematics,100%,3.0
1472,1879822,2.0,2,Callie Burnley,Management,0%,3.5
669,1724086,2.7,24,Callie Burnley,Business,32%,2.9


It appears that some professors are listed under multiple departments. As long as these departments are distinct and not sub-groups within the same larger department, no immediate correction is needed.

However, to ensure consistency, let’s also check if any professor names contain a middle name that may affect grouping or matching.

In [110]:
# Show professors whose names contain a middle name (more than two words)
middle_names = df[df['Professor_Name'].apply(lambda name: len(str(name).split()) > 2)]

# Display the result
middle_names

Unnamed: 0,Professor_ID,Avg_Rating,Total_Ratings,Professor_Name,Department,Would_Take_Again,Avg_Difficulty


In [120]:
# Keep only the first and last parts of the name (remove middle names/initials)
df['Professor_Name'] = df['Professor_Name'].apply(
    lambda name: ' '.join(str(name).split()[::len(str(name).split())-1]) 
    if len(str(name).split()) > 2 else name)

In [126]:
#Let's Double Check
print(middle_names)

Empty DataFrame
Columns: [Professor_ID, Avg_Rating, Total_Ratings, Professor_Name, Department, Would_Take_Again, Avg_Difficulty]
Index: []


We've successfully standardized the Professor_Name column. Now, let's move on to cleaning the remaining columns.

In [68]:
#identify unique values within the Column to ensure no unsual values or expressions
df["Department"].unique()

array(['History', 'Communication', 'Psychology', 'Music', 'English',
       'Engineering', 'Mathematics', 'Apparel Merchandising', 'Biology',
       'Languages', 'Science', 'Political Science', 'Agriculture',
       'Electrical Engineering', 'International Business',
       'Animal Science', 'Medicine', 'Business', 'Philosophy', 'Physics',
       'Management', 'Literature', 'Accounting', 'Computer Science',
       'Urban Regional Planning', 'Architecture', 'Graphic Arts',
       'Marketing', 'Computer Information Systems', 'Chemistry',
       'Economics', 'Finance', 'Health Science', 'Geography',
       'Art History', 'Theater', 'Ethnic Studies', 'Social Work',
       'Anthropology', 'Culinary Arts', 'Landscape Architecture', 'Law',
       'Physical Ed', 'Film', 'Education', 'General Ed', 'Hospitality',
       'Sociology', 'Geology', 'Kinesiology', 'Criminal Justice',
       'Technology Operations Mgmt', 'Veterinary Sciences', 'Design',
       'Fine Arts', 'Mechanical Engineering', 'Co

Within the Department column, we notice several inconsistencies and variations in naming that refer to the same or similar departments. These variations can lead to incorrect grouping or fragmented analysis if left uncorrected. Some examples include:

English & Lanagues vs. English

Information Science vs. Computer Information System

ScienceEnginnering vs. Science/Engineering

Let's correct this by creating a mapping dictionary to define the standard department name for each variation to ensure accurate grouping, filtering, and analysis based on the departments


In [34]:
# Let create an dictionary in order to replace the Department Names
department_dict = {
    'English & Languages': 'English',
    'Science/Engineering': 'Science',
    'Information Science': 'Computer Information Systems',
    'Anthropology & Geo Sciences': 'Anthropology',
    'Anthropology & Geo Sciences': 'Anthropology',
    'ScienceEngineering': 'Science',
    'Physical Education': 'Physical Ed',
    'International Bus. & Marketing': 'International Business',
    'Electrical Engineering & Computer Science': 'Electrical Engineering',
    'Engineering & Computer Science': 'Computer Engineering',
    'Landscape Architecture & Regional Planning': 'Landscape Architecture',
    'Technology & Operations Mgmt' : 'Technology Operations Mgmt',
    'Urban & Regional Planning' : 'Urban Regional Planning',
    'Urban Design & Development' : 'Urban Design',
    'Interdisciplinary General Ed.' : 'General Ed',
    "Women's Studies": 'Gender Studies', 
    "Foods & Nutrition": 'Nutrition',
}    

In [36]:
# Replacing the values within the Column with the corrected grouping with our department_dict and doubling checking if our mapping works
df['Department'] = df['Department'].replace(department_dict)
df["Department"].unique()

array(['History', 'Communication', 'Psychology', 'Music', 'English',
       'Engineering', 'Mathematics', 'Apparel Merchandising', 'Biology',
       'Languages', 'Science', 'Political Science', 'Agriculture',
       'Electrical Engineering', 'International Business',
       'Animal Science', 'Medicine', 'Business', 'Philosophy', 'Physics',
       'Management', 'Literature', 'Accounting', 'Computer Science',
       'Urban Regional Planning', 'Architecture', 'Graphic Arts',
       'Marketing', 'Computer Information Systems', 'Chemistry',
       'Economics', 'Finance', 'Health Science', 'Geography',
       'Art History', 'Theater', 'Ethnic Studies', 'Social Work',
       'Anthropology', 'Culinary Arts', 'Landscape Architecture', 'Law',
       'Physical Ed', 'Film', 'Education', 'General Ed', 'Hospitality',
       'Sociology', 'Geology', 'Kinesiology', 'Criminal Justice',
       'Technology Operations Mgmt', 'Veterinary Sciences', 'Design',
       'Fine Arts', 'Mechanical Engineering', 'Co

It looks like we’ve successfully removed unusual formatting and cleaned up the inconsistencies within the Department column. We’ve grouped and standardized department names by applying a mapping dictionary, ensuring all similar departments are now labeled consistently.

In [38]:
#identify unique values within the Column to ensure no unsual values or expressions
df['Would_Take_Again'].unique()

array(['49%', '88%', '90%', '43%', '75%', '100%', '89%', '25%', '36%',
       '97%', '32%', '96%', '54%', '84%', '93%', '50%', '87%', '23%',
       '72%', '94%', '40%', '82%', '66%', '3%', '67%', '48%', '44%',
       '56%', '22%', '42%', '20%', '39%', '14%', '0%', '98%', '31%',
       '53%', '60%', '80%', '38%', '74%', '64%', '68%', '24%', '76%',
       '86%', '77%', '34%', '35%', '71%', '27%', '92%', '13%', '62%',
       '70%', '78%', '30%', '65%', '16%', '52%', '91%', '85%', '15%',
       '17%', '7%', '26%', '5%', '29%', '12%', '57%', '69%', '19%', '46%',
       '11%', '4%', '45%', '79%', '37%', '95%', '18%', '81%', '41%',
       '73%', '83%', '58%', '59%', '55%', '63%', '33%', '8%', '47%',
       '61%', '10%', '6%', '99%', '9%', nan, '28%'], dtype=object)

We know that the Would_Take_Again column should be converted to a floating-point number in order to perform numerical calculations. However, due the "%" symbol attempting to convert them directily will result in an error as strings containing non-number characters cannot be cast to floats

In [40]:
# Remove the "%" symbol form the column to preapre it for numeric conversion
df['Would_Take_Again'] = df['Would_Take_Again'].str.replace('%', ' ')

In [42]:
# Double Checking
df['Would_Take_Again']

0       49 
1       88 
2       90 
3       43 
4       75 
       ... 
2559    NaN
2560    NaN
2561    NaN
2562    NaN
2563    NaN
Name: Would_Take_Again, Length: 2564, dtype: object

In [44]:
# Conversion to float 
df['Would_Take_Again']= df['Would_Take_Again'].astype(float)

In [46]:
#Double checking
df.head(10)

Unnamed: 0,Professor_ID,Avg_Rating,Total_Ratings,Professor_Name,Department,Would_Take_Again,Avg_Difficulty
0,2073648,3.2,96,Mark Okuhata,History,49.0,3.5
1,2676805,4.5,74,Melody Adejare,Communication,88.0,3.3
2,54640,4.7,68,Jill Nemiro,Psychology,90.0,2.5
3,2147335,3.0,35,Robert Blumenfeld,Psychology,43.0,3.3
4,1087541,3.8,26,Juliana Fuqua,Psychology,75.0,2.8
5,2652433,5.0,19,Sarah Huff,Music,100.0,1.5
6,2630928,4.8,5,Tatiana Pumaccahua,Psychology,100.0,1.4
7,2002633,4.9,44,Steven Camacho,English,100.0,2.6
8,2171196,3.3,27,Suresh Ganapathy,Engineering,89.0,2.8
9,3073381,2.3,4,Kora Tsay,Mathematics,25.0,4.0


In [48]:
#identify unique values within the Column to ensure no unsual values or expressions
df['Avg_Difficulty'].unique() 

array([3.5, 3.3, 2.5, 2.8, 1.5, 1.4, 2.6, 4. , 2.4, 3.6, 3.1, 3.4, 3. ,
       2.1, 1.8, 2.7, 1.3, 2. , 4.1, 4.4, 3.8, 2.3, 3.7, 4.2, 2.2, 4.6,
       3.2, 3.9, 2.9, 4.3, 5. , 1.7, 4.5, 1.9, 4.9, 1.6, 1. , 1.2, 1.1,
       4.8, 4.7])

In [72]:
# Checking if any values within the column exceeds the constraint limit of 5 
(df['Avg_Difficulty'].unique() <= 5).all()

True

In [52]:
# Final Check to ensure that the data is consistent 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2564 entries, 0 to 2563
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Professor_ID      2564 non-null   int64  
 1   Avg_Rating        2564 non-null   float64
 2   Total_Ratings     2564 non-null   int64  
 3   Professor_Name    2564 non-null   object 
 4   Department        2564 non-null   object 
 5   Would_Take_Again  2242 non-null   float64
 6   Avg_Difficulty    2564 non-null   float64
dtypes: float64(3), int64(2), object(2)
memory usage: 140.3+ KB


In [138]:
# Shuffle the entire Dataframe randomly by sampling 100% of the data in a new order
df.sample(frac = 1)

Unnamed: 0,Professor_ID,Avg_Rating,Total_Ratings,Professor_Name,Department,Would_Take_Again,Avg_Difficulty
165,2249590,1.9,16,Wendy Dixon,Biology,19%,3.3
1519,1425423,2.8,12,Meihua Koo,Accounting,34%,4.1
1387,2836204,5.0,1,Rebecca Valbuena,Education,100%,3.0
2180,861523,2.5,4,Anwar Salimi,Accounting,100%,1.5
828,3037156,5.0,2,Nicole Reynolds,Biology,100%,3.0
...,...,...,...,...,...,...,...
2464,809077,4.8,10,Richard Burky,Geography,,1.6
1961,2540590,4.0,1,Brenda Ramirez,Biology,100%,1.0
1125,2427208,2.5,10,Isabel Bustamante,Languages,70%,2.5
2359,802211,3.8,4,Doug Spoon,Journalism,,3.8


# Finalization of Cleaned Dataframe

In [None]:
# Saving the Cleaned Dataframe
df.to_csv('clean_ratemyprofessors.csv',index=False)

The Dataset is Ready 