### Module 4: Introduction to NumPy, Pandas, and Matplotlib

#### Case Study–2

#### Business challenge/requirement
You are a data analyst with the University of Cal USA (Not a machine learning expert
yet as you still have not completed the ML with Python Course :-)). The University
has data on Math, Physics, and Data Structure scores of sophomore students. This
data is stored in different files. The University has hired a data science company to do
an analysis of scores and find if there is any correlation between scores with age,
ethnicity, etc. Before the data is given to the company you have to do data wrangling.

Key issues
Ensure student’s identity is not revealed to the agency and only relevant data is
shared.

Data volume
- In thousands, but only around 1800 records are shared in files MathScoreTerm1.csv
DSScoreTerm1.csv, PhysicsScoreTerm1.csv

Business benefits
University can get more students enrollment by improving their international
ranking through personalized courses/curricula for students.

Approach to Solve
You have to use the fundamentals of Numpy and Pandas covered in module 4.
1. Read the three CSV files which contain the score of the same students in term1 of each Subject
2. Remove the name and ethnicity column (to ensure confidentiality)
3. Fill missing score data with zero.
4. Merge the three files
5. Change Sex(M/F) Column to 1/2 for further analysis
6. Store the data in a new file – ScoreFinal.csv

Enhancements for code
You can try these enhancements in code
1. Convert ethnicity to a numerical value
2. Fill the missing score for a student to the average of the class


In [2]:
# Step 1: Read the three CSV files which contain the score of the same students in term1 of each Subject

import pandas as pd

math_df = pd.read_csv("MathScoreTerm1.csv")
ds_df = pd.read_csv("DSScoreTerm1.csv")
physics_df = pd.read_csv("PhysicsScoreTerm1.csv")

In [3]:
# Step 2 : Remove the name and ethnicity column (to ensure confidentiality)

confidential_cols = ["Name", "Ethnicity"]
for col in confidential_cols:
    if col in math_df.columns:
        math_df = math_df.drop(columns=[col])
    if col in ds_df.columns:
        ds_df = ds_df.drop(columns=[col])
    if col in physics_df.columns:
        physics_df = physics_df.drop(columns=[col])


In [4]:
#Step 3 - Fill missing score data with zero.

math_df = math_df.fillna(0)
ds_df = ds_df.fillna(0)
physics_df = physics_df.fillna(0)

In [9]:
#Step 4 - Merge the three files on StudentID (ID)

merged_df = math_df.merge(ds_df, on="ID").merge(physics_df, on="ID")

In [10]:
# Step 5 - Change Sex(M/F) Column to 1/2 for further analysis

if "Sex" in merged_df.columns:
    merged_df["Sex"] = merged_df["Sex"].map({"M": 1, "F": 2})

In [11]:
# Step 6 - Store the data in a new file – ScoreFinal.csv

merged_df.to_csv("ScoreFinal.csv", index=False)

In [12]:
# 1. Convert Ethnicity to numerical values
# Example mapping: Asian=1, Hispanic=2, White=3, Black=4, Other=5
ethnicity_map = {"Asian": 1, "Hispanic": 2, "White": 3, "Black": 4, "Other": 5}
for df in [math_df, ds_df, physics_df]:
    if "Ethnicity" in df.columns:
        df["Ethnicity"] = df["Ethnicity"].map(ethnicity_map)


In [13]:
# 2. Fill the missing score for a student to the average of the class

for col in ["MathScore", "DSScore", "PhysicsScore"]:
    if col in merged_df.columns:
        class_avg = merged_df[col].mean()
        merged_df[col].fillna(class_avg, inplace=True)
