# Innopolis University
# Summer Semester 2025
# Programming in Python 
## Assignment 2: Data Analysis with NumPy and Pandas (20 points)

**Due Date**: 20.07.2025 23:59
---

### Overview
In this assignment, you will practice using **Pandas** and **NumPy** to perform data loading, manipulation, and linear algebra operations such as computing eigenvalues and eigenvectors. You will work step-by-step in a Jupyter Notebook, writing functions, validating results with `assert` statements, and producing simple visualizations.

**Learning Objectives:**
- Load and inspect CSV data with Pandas.
- Manipulate DataFrames: handling missing values, normalization, and basic statistics.
- Compute covariance and correlation matrices.
- Calculate eigenvalues and eigenvectors with NumPy.
- Sort and select principal components.
- Verify mathematical properties with assertions.

---

## Preliminary: Provided Student Dataset
Below is the content of `students.csv`, which you will use throughout this assignment. It contains student records, including some missing values in `Email`, `Scholarship`, and `GPA`.

```
Name,Surname,Age,GPA,Course,Hobby,Scholarship,Have Retakes,Email
Nyra,Xanthe,18,3.74,Software Engineering,Writing,No,No,nyra.xanthe@innopolis.university
Doran,Mavren,18,2.38,Robotics,Photography,Yes,Yes,doran.mavren@innopolis.university
Aylen,Dorin,22,4.78,Computer Vision,Hiking,No,No,aylen.dorin@innopolis.university
Kael,Halden,21,4.18,Robotics,Photography,No,No,kael.halden@innopolis.university
Varek,Halden,18,3.74,AI Engineering,Writing,No,No,varek.halden@innopolis.university
Kael,Senn,18,2.53,Mathematics,Coding,No,Yes,kael.senn@innopolis.university
Casen,Orlin,18,4.16,Data Science,Baking,Yes,No,casen.orlin@innopolis.university
Aylen,Kaelen,23,2.5,Robotics,Gymnastics,No,Yes,aylen.kaelen@innopolis.university
Elira,Dorin,18,2.78,Computer Science,Rock Climbing,No,Yes,elira.dorin@innopolis.university
Riven,Drelor,25,2.43,Computer Science,Writing,Yes,Yes,riven.drelor@innopolis.university
Bram,Ravick,19,2.64,Software Engineering,Writing,Yes,Yes,bram.ravick@innopolis.university
Mira,Ravick,21,2.47,Mathematics,Piano,Yes,Yes,mira.ravick@innopolis.university
Jalen,Voss,21,,Mathematics,Writing,No,No,jalen.voss@innopolis.university
... (remaining rows as provided) ...
```

name of the file is `students.csv`.


## Step 1: Load and Inspect Data (2 points)
1. Import `pandas` and `numpy`.
2. Read `students.csv` into a DataFrame `df`.
3. Display the first 5 rows and summary of missing values.

In [1346]:
# Karim Khabibrakhmanov, DSAI-05, G3, k.khabibrakhmanov@innopolis.university

# 1. Importing libraries
import pandas as pd
import numpy as np

# 2. Reading students.csv into df
df = pd.read_csv('students.csv')

# 3. Displaying the first 5 rows and summary about missing values
print(df.head(5))
print(df.isna().sum())

    Name Surname  Age   GPA                Course        Hobby Scholarship  \
0   Nyra  Xanthe   18  3.74  Software Engineering      Writing          No   
1  Doran  Mavren   18  2.38              Robotics  Photography         Yes   
2  Aylen   Dorin   22  4.78       Computer Vision       Hiking          No   
3   Kael  Halden   21  4.18              Robotics  Photography          No   
4  Varek  Halden   18  3.74        AI Engineering      Writing          No   

  Have Retakes                              Email  
0           No   nyra.xanthe@innopolis.university  
1          Yes  doran.mavren@innopolis.university  
2           No   aylen.dorin@innopolis.university  
3           No   kael.halden@innopolis.university  
4           No  varek.halden@innopolis.university  
Name            0
Surname         0
Age             0
GPA             3
Course          0
Hobby           4
Scholarship     4
Have Retakes    0
Email           4
dtype: int64


In [1347]:
assert isinstance(df, pd.DataFrame)
assert df.shape[0] >= 10 and df.shape[1] == 9

## Step 2: Drop Students without Email (2 points)

Remove any rows where the `Email` field is missing or blank.

In [1348]:
# Deleting all fields with missing Email
df.dropna(subset = ['Email'],inplace=True)

In [1349]:
assert df['Email'].isnull().sum() == 0

## Step 3: Impute Scholarship Based on GPA (2 points)

For any missing `Scholarship` entries, set to `'Yes'` if `GPA > 4.5`, otherwise `'No'`.

In [1350]:
# Function that set 'Yes' if GPA > 4.5, otherwise 'No' for missing information about Scholarship
def impute_scholarship(row):
    if pd.isnull(row['Scholarship']):
        row['Scholarship'] = 'Yes' if row['GPA']>4.5 else 'No'
    
    return row['Scholarship']

# Applying impute_scholarship into Data Frame
df['Scholarship'] = df.apply(impute_scholarship, axis=1)

In [1351]:
assert df['Scholarship'].isnull().sum() == 0
unique_vals = set(df['Scholarship'].unique())
assert unique_vals <= {'Yes','No'}
assert df['Scholarship'].value_counts()["Yes"] == 47
assert df['Scholarship'].value_counts()["No"] == 49


## Step 4: Impute Missing GPA with Course Average (2 points)

Replace any missing `GPA` with the average GPA for that student's course.

In [1352]:
# Computing the average GPA for each existing courses
avg_per_course = df.groupby('Course')['GPA'].mean()

# Function that replace any missing GP` with the average GPA for a given course
def impute_gpa(row):
    if pd.isnull(row['GPA']):
        row['GPA'] = avg_per_course[row['Course']]
    return row['GPA']

In [1353]:
# Applying impute_gpa into Data Frame
df['GPA'] = df.apply(impute_gpa, axis=1)

In [1354]:
assert df['GPA'].isnull().sum() == 0
assert abs(df['GPA'].mean()-3.4687) < 0.001

## Step 5: Compute Average GPA per Course (2 points)

Calculate and display the average GPA for each `Course`.


In [1355]:
# Calculation was performed in Step 4
# Displaying the average GPA for each Course 
print(avg_per_course)

Course
AI Engineering          3.640000
Computer Science        3.056667
Computer Vision         3.786667
Cybersecurity           3.462500
Data Science            3.911250
Information Security    2.975000
Mathematics             3.506364
Robotics                3.260909
Software Engineering    3.706667
Name: GPA, dtype: float64


In [1356]:
assert isinstance(avg_per_course, pd.Series)
assert set(avg_per_course.index) == set(df['Course'].unique())
assert abs(avg_per_course.sum()-31.306022)<0.0001

## Step 6: Drop Non-Categorical, Non-Numerical Columns (2 points)

Remove columns that are neither numeric nor categorical features for analysis ( `Name`, `Surname`, `Email`, `Hobby`).


In [1357]:
# List containing Non-Categorical, Non-Numerical Column names (`Name`, `Surname`, `Email`, `Hobby`)
cols_to_drop = ['Name', 'Surname', 'Email', 'Hobby']

# Fixing the problem: KeyError: "['Name', 'Surname', 'Email', 'Hobby'] not found in axis"
found_in_axis_columns = [col for col in cols_to_drop if col in df.columns]

# Droping this columns
df.drop(columns=found_in_axis_columns, inplace=True)

In [1358]:
for col in cols_to_drop:
    assert col not in df.columns

## Step 7: Encode Categorical Variables (2 points)

Convert all categorical columns (`Course`, `Scholarship`, `Have Retakes`) to numeric codes starting from 1.


In [1359]:
# List containing Categorical Column names (`Course`, `Scholarship`, `Have Retakes`)
cat_cols = ['Course', 'Scholarship', 'Have Retakes']

# I have finalized my solution based on the user's response: shantanu pathak
# Source: https://arc.net/l/quote/rapkvnlm

# Making these columns Categorical
df[cat_cols] = df[cat_cols].astype('category')

# Replacing these columns with numbers starting from 1
for col in df[cat_cols].columns:
    df[col] = df[col].cat.codes + 1

# Displaying types of these column items
print(df[cat_cols].dtypes)

Course          int8
Scholarship     int8
Have Retakes    int8
dtype: object


In [1360]:
for col in cat_cols:
    assert pd.api.types.is_integer_dtype(df[col])
    assert df[col].min() == 1


## Step 8: Normalize All Features (2 points)

Normalize the numeric DataFrame so each column has zero mean and unit variance.

In [1361]:
# List containing Numeric Column names (`Age`, `GPA`)
num_cols = ['Age', 'GPA']

# Making these columns Float types
df[num_cols] = df[num_cols].astype('float')

# I used normalize with Z-score
# Source: https://arc.net/l/quote/szujbnhv
df_norm = (df[num_cols] - df[num_cols].mean()) / df[num_cols].std(ddof=0)

In [1362]:
assert np.allclose(df_norm.mean(), 0, atol=1e-8)
assert np.allclose(df_norm.std(ddof=0), 1, atol=1e-8)

## Step 9: Compute Covariance Matrix (2 points)

Compute the covariance matrix of the normalized data.

In [1363]:
# Information about covariance matrix
# Source: https://www.geeksforgeeks.org/maths/covariance-matrix/

# Information about the built-in matrix calculation
# Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html
cov_matrix = df_norm.cov()

In [1364]:
d = df_norm.shape[1]
assert cov_matrix.shape == (d, d)
assert np.allclose(cov_matrix, cov_matrix.T, atol=1e-8)

## Step 10: Eigenvalues and Eigenvectors (2 points)

Compute eigenvalues and eigenvectors of the covariance matrix.


In [1365]:
# Information about the built-in functions for counting eigenvalues and eigenvectors
# Source: https://pythonnumericalmethods.studentorg.berkeley.edu/notebooks/chapter15.04-Eigenvalues-and-Eigenvectors-in-Python.html
eig_vals,eig_vecs = np.linalg.eig(cov_matrix)

In [1366]:
assert eig_vals.shape[0] == d
assert eig_vecs.shape == (d, d)

In [1367]:
idx = np.argmax(eig_vals)
v = eig_vecs[:, idx]
assert np.allclose(cov_matrix.dot(v), eig_vals[idx] * v, atol=1e-6)