# Data Cleaning

In this file, we are going to clean our data and save the created dataframes for future analysis.

While we attempted to reduce the amount of cleaning required, via good survey design, there are still some cases that would require cleaning. 

## Import packages

In [195]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import utils
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from fuzzywuzzy import fuzz

## Inspect data

Load data

In [196]:
data_filepath = "../data/strike_and_academic_performance.csv"

data = pd.read_csv(data_filepath)

In [197]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 431 entries, 0 to 430
Data columns (total 21 columns):
 #   Column                                                                              Non-Null Count  Dtype  
---  ------                                                                              --------------  -----  
 0   Are you a student of the University of Lagos?                                       431 non-null    object 
 1   If not, what is your university/institution?                                        22 non-null     object 
 2   What is your current academic level?                                                431 non-null    object 
 3   How old are you?                                                                    431 non-null    int64  
 4   What is your gender?                                                                431 non-null    object 
 5   What was your relationship status during the strike?                                431 non-null   

Are there any missing values?

In [198]:
data.isna().sum()

Are you a student of the University of Lagos?                                           0
If not, what is your university/institution?                                          409
What is your current academic level?                                                    0
How old are you?                                                                        0
What is your gender?                                                                    0
What was your relationship status during the strike?                                    0
What is your faculty?                                                                   0
What is your department?                                                                0
Kindly input your department, if not listed in the previous question.                 403
How has the ASUU strike affected you and your academic performance?                    86
What was the most challenging part of returning to academic life after the strike?     91
Did you un

### Rename Columns
The current column names are too long. Let's make them shorter for easier analysis.

In [199]:
# Rename multiple columns

column_mapping = {
    'Are you a student of the University of Lagos?': 'unilag',
    'If not, what is your university/institution?': 'non_unilag',
    'What is your current academic level?': 'level',
    'How old are you?': 'age',
    'What is your gender?': 'gender',
    'What was your relationship status during the strike?': 'relationship',
    'What is your faculty?': 'faculty',
    'What is your department?': 'department',
    'Kindly input your department, if not listed in the previous question. ': 'other_dept',
    'How has the ASUU strike affected you and your academic performance?': 'strike_effect',
    'What was the most challenging part of returning to academic life after the strike?': 'challenge',
    'Did you undertake any work during the strike?': 'work',
    'How did you develop yourself during the strike?': 'skills',
    'How prepared were you for the exams? [Before Strike]': 'prep_before',
    'How prepared were you for the exams? [After Strike]': 'prep_after',
    'How were your lectures affected by the strike?': 'lecture',
    'How often did you engage in academic activities during the strike?': 'academic_act',
    'How many courses did you take in the affected semester? ': 'courses_taken',
    'How many credit units did your courses add up to in the affected semester?': 'course_unit',
    'What was your CGPA before the strike?': 'cgpa_before',
    'What is your current CGPA?': 'cgpa_after'
}

data = data.rename(columns=column_mapping)

data.head(2)

Unnamed: 0,unilag,non_unilag,level,age,gender,relationship,faculty,department,other_dept,strike_effect,...,work,skills,prep_before,prep_after,lecture,academic_act,courses_taken,course_unit,cgpa_before,cgpa_after
0,Yes,,400 Level,22,Male,Single,Engineering,Chemical Engineering,,I learned how to study better and my grades al...,...,Worked in a role relevant to my studies,Acquired skills unrelated to course of study,Poorly,Poorly,No noticeable change,Rarely: I engaged in academic activities once ...,10,23.0,3.39,3.51
1,Yes,,400 Level,23,Female,Single,Engineering,Chemical Engineering,,It affected it in a negative way as it became ...,...,Did not work during the strike,Acquired skills unrelated to course of study,Poorly,Moderately,No noticeable change,Rarely: I engaged in academic activities once ...,10,23.0,4.44,4.5


### Creating our target column
Our target column in this analysis is the change in CGPA after the strike. We're trying to see if the strike had a positive or negative effect on the participants.

If CGPA increases after the strike, it may be possible that the strike had a positive effect on the participants. And vice versa. 

In [200]:
#Creating our outcome variable column

data['cgpa_change'] = data['cgpa_after'] - data['cgpa_before']

## Solving the department debacle

There are 2 columns for department `department` and `other_dept`, 
1. One contains the main department of individuals who had their department on the list while filling, and those that did not find theirs and had to specify
2. We have to find a way to merge them, as clearly one of the columns aren't needed

In [201]:
data.department.unique()

array(['Chemical Engineering', 'Political Science',
       'Computer Engineering', 'Educational Foundations', 'Statistics',
       'Geosciences', 'Science Tech. Education',
       'Petroleum & Gas Engineering', 'Cell Biology & Genetics',
       'Surveying & Geo-Informatics Engineering', 'Mathematics',
       'Finance', 'Marine Science',
       'Industrial Relations & Personnel Management',
       'Mechanical Engineering', 'Mass Communication',
       'Biomedical Engineering', 'Estate', 'Other',
       'Biochemistry (Basic Medical Sciences)', 'Law', 'Medicine',
       'Arts & Social Science Education', 'Zoology',
       'Biochemistry (Sciences)', 'Education Administration', 'Botany',
       'Economics', 'Systems Engineering', 'Psychology', 'Accounting',
       'Physics', 'Radiology', 'Electrical & Electronics Engineering',
       'Geography', 'Microbiology', 'Chemistry', 'Architecture',
       'Biology Education', 'Human Kinetics & Health Education',
       'Physiology', 'Adult Educatio

In [202]:
data.other_dept.unique()

array([nan, 'Geophysics ', 'Radiography ', 'Biology Education ',
       'FISHERIES ', 'Education and Biology ', 'Pharmacy ', 'Law ',
       'Early childhood education ', 'Education Eng',
       'Business Education ', 'Education ', 'Business Education',
       'Religious Studies ', 'Technology and vocational education ',
       'Communication and Language Arts ', 'Pharmacology ',
       'Pharmacology, therapeutics and toxicology ', 'PHARMACY ',
       'Pharmacy', 'Pharmacology', 'Mechatronics Engineering.',
       'Banking and Finance ', 'Insurance ', 'Education foundation ',
       'Art & Social Science Education '], dtype=object)

Replace "Other" in `department` with the corresponding value from `other_dept`

In [203]:
data['department'] = data['department'].str.lower()
data['other_dept'] = data['other_dept'].str.lower()


data['department'] = data.apply(lambda row: row['other_dept'] if row['department'] == 'other' else row['department'], axis=1).str.capitalize()


In [204]:
#Deal with any missing data in this column
data[data['department'].isna()]

Unnamed: 0,unilag,non_unilag,level,age,gender,relationship,faculty,department,other_dept,strike_effect,...,skills,prep_before,prep_after,lecture,academic_act,courses_taken,course_unit,cgpa_before,cgpa_after,cgpa_change
51,Yes,,200 Level,20,Male,Single,Pharmacy,,,,...,Acquired skills unrelated to course of study,Moderately,Moderately,Fewer lecturers attended classes,Often: I engaged in academic activities regula...,7,,5.0,4.89,-0.11
331,Yes,,300 Level,22,Female,Single,Social Sciences,,,I am mentally tired,...,"Volunteered for an event or organization, Acqu...",Moderately,Poorly,No noticeable change,Rarely: I engaged in academic activities once ...,7,18.0,0.0,0.0,0.0


There are two missing values here. We know that "Pharmacy is the most common value in `other_dept` for Pharmcy students. Hence, we can replace the NaN there with it. 

In [205]:
#For location 51
data.loc[51, 'department'] = 'Pharmacy'

However, we have no context here and would have to drop.

In [206]:
# Drop row with index 331 and reindex the DataFrame
data.drop(331, inplace=True)
data.reset_index(drop=True, inplace=True)

How many rows do we have in the data now?

In [207]:
len(data['department'])

430

Now let's drop the other_dept column.

In [208]:
data.drop(columns=['other_dept'], inplace=True)
data.head(2)

Unnamed: 0,unilag,non_unilag,level,age,gender,relationship,faculty,department,strike_effect,challenge,...,skills,prep_before,prep_after,lecture,academic_act,courses_taken,course_unit,cgpa_before,cgpa_after,cgpa_change
0,Yes,,400 Level,22,Male,Single,Engineering,Chemical engineering,I learned how to study better and my grades al...,Trying to remember things we were taught befor...,...,Acquired skills unrelated to course of study,Poorly,Poorly,No noticeable change,Rarely: I engaged in academic activities once ...,10,23.0,3.39,3.51,0.12
1,Yes,,400 Level,23,Female,Single,Engineering,Chemical engineering,It affected it in a negative way as it became ...,"Rekindling the student in me, lol. Trying to g...",...,Acquired skills unrelated to course of study,Poorly,Moderately,No noticeable change,Rarely: I engaged in academic activities once ...,10,23.0,4.44,4.5,0.06


## Are the columns in the right data types?

In [209]:
data = data.convert_dtypes()

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   unilag         430 non-null    string 
 1   non_unilag     22 non-null     string 
 2   level          430 non-null    string 
 3   age            430 non-null    Int64  
 4   gender         430 non-null    string 
 5   relationship   430 non-null    string 
 6   faculty        430 non-null    string 
 7   department     430 non-null    string 
 8   strike_effect  344 non-null    string 
 9   challenge      339 non-null    string 
 10  work           430 non-null    string 
 11  skills         430 non-null    string 
 12  prep_before    430 non-null    string 
 13  prep_after     430 non-null    string 
 14  lecture        430 non-null    string 
 15  academic_act   430 non-null    string 
 16  courses_taken  430 non-null    Int64  
 17  course_unit    352 non-null    Float64
 18  cgpa_befor

Let's make some further conversions. 

In [210]:
# convert some columns to category
cat_columns = [
    "level", "gender", "relationship", "faculty", "department", "work","prep_before", "prep_after", "lecture", "academic_act", 
    ]

data[cat_columns] = data[cat_columns].astype('category')
    
# convert course_unit to integar
data["course_unit"] = data["course_unit"].astype('Int64')
    
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   unilag         430 non-null    string  
 1   non_unilag     22 non-null     string  
 2   level          430 non-null    category
 3   age            430 non-null    Int64   
 4   gender         430 non-null    category
 5   relationship   430 non-null    category
 6   faculty        430 non-null    category
 7   department     430 non-null    category
 8   strike_effect  344 non-null    string  
 9   challenge      339 non-null    string  
 10  work           430 non-null    category
 11  skills         430 non-null    string  
 12  prep_before    430 non-null    category
 13  prep_after     430 non-null    category
 14  lecture        430 non-null    category
 15  academic_act   430 non-null    category
 16  courses_taken  430 non-null    Int64   
 17  course_unit    352 non-null    Int6

## Unifying the schools

Similar to the departmment debacle, we have two columns to merge. 
1. `unilag`: Contains a Yes/No option for students of the University of Lagos. 
2. `non_unilag`: Students not from the University of Lagos had to include their schools.

Which schools are represented in this dataset?

In [211]:
data.non_unilag.unique()

<StringArray>
[                                                                       <NA>,
                                                                       'Nil',
                                                                   'UNILAG ',
             'Federal University of Petroleum Resources Effurun Delta State',
                                                           'OOU OGUN STATE ',
                                                     'University of Ibadan ',
 'Alex Ekwueme Federal University Ndufu-Alike Ikwo, Abakaliki Ebonyi State ',
                                                      'University of ibadan',
                                                    'Lagos state University',
                                                                    'Funaab',
                                                        'University of Uyo ',
                                                       'University of Abuja',
                                                  

### Creating the school column.

Let's create a variable, called school. If this variable is NaN, we replace the NaN with UNILAG. Then we drop the UNILAG column.

In [212]:
# rename non_unilag column to school
data.rename(columns={"non_unilag":"school"}, inplace=True)

# replace NaN with "unilag"
data["school"] = data["school"].fillna("university of lagos")

# strip the column
data["school"] = data["school"].str.lower().str.strip()

# drop the "unilag" column
data.drop(columns = "unilag", inplace=True)

data.head()

Unnamed: 0,school,level,age,gender,relationship,faculty,department,strike_effect,challenge,work,skills,prep_before,prep_after,lecture,academic_act,courses_taken,course_unit,cgpa_before,cgpa_after,cgpa_change
0,university of lagos,400 Level,22,Male,Single,Engineering,Chemical engineering,I learned how to study better and my grades al...,Trying to remember things we were taught befor...,Worked in a role relevant to my studies,Acquired skills unrelated to course of study,Poorly,Poorly,No noticeable change,Rarely: I engaged in academic activities once ...,10,23.0,3.39,3.51,0.12
1,university of lagos,400 Level,23,Female,Single,Engineering,Chemical engineering,It affected it in a negative way as it became ...,"Rekindling the student in me, lol. Trying to g...",Did not work during the strike,Acquired skills unrelated to course of study,Poorly,Moderately,No noticeable change,Rarely: I engaged in academic activities once ...,10,23.0,4.44,4.5,0.06
2,nil,400 Level,21,Male,Dating,Engineering,Chemical engineering,It has actually helped me a bit. The extended ...,Readapting to school,Worked in a role unrelated to my studies,"Volunteered for an event or organization, Acqu...",Moderately,Moderately,Fewer lecturers attended classes,Rarely: I engaged in academic activities once ...,10,23.0,3.54,3.61,0.07
3,university of lagos,400 Level,29,Male,Dating,Social Sciences,Political science,Good,Reading,Worked in a role unrelated to my studies,Acquired skills unrelated to course of study,Moderately,Very,No noticeable change,Rarely: I engaged in academic activities once ...,7,,3.86,3.96,0.1
4,university of lagos,100 Level,18,Female,Single,Engineering,Computer engineering,,,Worked in a role relevant to my studies,"Volunteered for an event or organization, Acqu...",Moderately,Moderately,Fewer lecturers attended classes,Rarely: I engaged in academic activities once ...,8,18.0,0.0,4.28,4.28


In [213]:
data["school"].unique()

<StringArray>
[                                                     'university of lagos',
                                                                      'nil',
                                                                   'unilag',
            'federal university of petroleum resources effurun delta state',
                                                           'oou ogun state',
                                                     'university of ibadan',
 'alex ekwueme federal university ndufu-alike ikwo, abakaliki ebonyi state',
                                                   'lagos state university',
                                                                   'funaab',
                                                        'university of uyo',
                                                      'university of abuja',
                                                   'bayero university kano',
                                                    'universit

Cleaning the school column

In [214]:
def clean_school(row):
    
    if ("unilag" in row 
        or "university of lagos" in row 
        or "i am" in row 
        or "nil" in row):
        return "university of lagos"
    elif "ui" in row:
        return "university of ibadan"
    elif "university of nigeria" in row:
        return "university of nigeria"
    else:
        return row
        
data["school"] = data["school"].apply(clean_school).astype('category')
data["school"].unique()

['university of lagos', 'federal university of petroleum resources eff..., 'oou ogun state', 'university of ibadan', 'alex ekwueme federal university ndufu-alike i..., ..., 'funaab', 'university of uyo', 'university of abuja', 'bayero university kano', 'university of nigeria']
Length: 11
Categories (11, object): ['alex ekwueme federal university ndufu-alike i..., 'bayero university kano', 'federal university of petroleum resources eff..., 'funaab', ..., 'university of ibadan', 'university of lagos', 'university of nigeria', 'university of uyo']

How many students are from other schools?

In [215]:
len(data[data["school"]!="university of lagos"])

16

## Quest for Duplicates

Are there any duplicates?

In [216]:
#Search for duplicates

data[data.duplicated()]

Unnamed: 0,school,level,age,gender,relationship,faculty,department,strike_effect,challenge,work,skills,prep_before,prep_after,lecture,academic_act,courses_taken,course_unit,cgpa_before,cgpa_after,cgpa_change
160,university of lagos,400 Level,22,Female,Single,Education,Arts & social science education,I just want to end all this..🥲,Having to return back to reading books and att...,Worked in a role unrelated to my studies,"Acquired skills unrelated to course of study, ...",Moderately,Poorly,No noticeable change,Never: I did not engage in any academic activi...,8,16,3.69,2.34,-1.35


In [217]:
data = data.drop_duplicates(keep='first')
data.reset_index(drop=True, inplace = True)

Checking missing values for each column, once more

In [218]:
df_valid_cgpa.isna().sum()

school            0
level             0
age               0
gender            0
relationship      0
faculty           0
department        0
strike_effect    60
challenge        63
work              0
skills            0
prep_before       0
prep_after        0
lecture           0
academic_act      0
courses_taken     0
course_unit       0
cgpa_before       0
cgpa_after        0
cgpa_change       0
dtype: int64

Lots of missing values in the `strike_effect` and `challenge` columns. We'll deal with that later.

## Predicting missing course unit values

Given the number of missing values in the course_unit column, we're going to create a simple model to predict the values.

### Relevant columns

Select the relevant columns to predict `course_units`. These are dependent on `level`, `faculty`, `department`, and `course_taken` .

In [219]:
#select relevant columns
course_unit_data  = data[["school", "level", "faculty", "department", "courses_taken","course_unit"]]
course_unit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429 entries, 0 to 428
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   school         429 non-null    category
 1   level          429 non-null    category
 2   faculty        429 non-null    category
 3   department     429 non-null    category
 4   courses_taken  429 non-null    Int64   
 5   course_unit    351 non-null    Int64   
dtypes: Int64(2), category(4)
memory usage: 13.4 KB


### Separating the valids from the invalids

Now, we're going to separate the course unit data. We'll create two dataframes to house the given values and the missing values. We aim to use the known values to predict the unknown.

In [220]:
from sklearn.linear_model import LinearRegression

# Step 1: Prepare the data

valid_course_units = (course_unit_data[course_unit_data['course_unit'].notnull() & (course_unit_data['course_unit'] > 0)]
             .reset_index(drop=True))
invalid_course_units = course_unit_data[course_unit_data["course_unit"].isnull()]

### Creating our features and target

Now we split the valid course units to create our features and target.

**Note**: We're going to be using the suffix "_cu" to represent course units.

In [221]:
# Step 2: Split the data into features (X) and target (y)
target = 'course_unit'

X_cu = valid_course_units.drop(columns=[target])
y_cu = valid_course_units[target].astype(np.int64)

### Encoding

Encode categorical variables

In [222]:
# Define the columns that need to be encoded (categorical columns)
categorical_cols = ['school','level', 'faculty',
                    'department']

# Define the preprocessor with OneHotEncoder for categorical columns and SimpleImputer for missing values

encoder = OneHotEncoder(handle_unknown='ignore')


preprocessor = ColumnTransformer(
    transformers=[
        ('cat', encoder, categorical_cols),
    ],
    remainder='passthrough'  # Place remainder parameter outside the transformers list
)

In [223]:
# Step 4: Fit and transform the preprocessor on the training data
X_cu_processed = preprocessor.fit_transform(X_cu)

### Training and test data

Create training and test data

In [224]:
from sklearn.model_selection import train_test_split

X_cu_train, X_cu_test, y_cu_train, y_cu_test = train_test_split(X_cu_processed, y_cu, test_size=0.2, random_state=42)

### Time to train the model!

In [225]:
# Step 5: Train a model on the training data
model = LinearRegression()
model.fit(X_cu_train, y_cu_train)

### Model Evaluation

What is our accuracy like?

In [226]:
# Step 6: Predict on the test data
y_cu_test_pred = model.predict(X_cu_test)

# Step 7: Calculate Mean Squared Error (MSE) on the test data
mse_cu = mean_squared_error(y_cu_test, y_cu_test_pred)

# Step 8: Calculate R-squared (coefficient of determination) on the test data
r2_cu = r2_score(y_cu_test, y_cu_test_pred)

print("Mean Squared Error (MSE):", mse_cu)
print("R-squared (Coefficient of Determination):", r2_cu)

Mean Squared Error (MSE): 23.78469293643814
R-squared (Coefficient of Determination): -0.0863873180035557


### Time to predict the missing values

In [227]:
# Step 6: Use the trained model to predict the missing values
X_cu_invalid = invalid_course_units.drop(columns=['course_unit'])
X_cu_invalid_processed = preprocessor.transform(X_cu_invalid)
predicted_cu_values = np.round(model.predict(X_cu_invalid_processed))

In [228]:
# Step 7: Assign the predicted values to the training data
invalid_course_units.loc[invalid_course_units.index, 'course_unit'] = np.ceil(predicted_cu_values)
invalid_course_units.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  invalid_course_units.loc[invalid_course_units.index, 'course_unit'] = np.ceil(predicted_cu_values)


Unnamed: 0,school,level,faculty,department,courses_taken,course_unit
3,university of lagos,400 Level,Social Sciences,Political science,7,12.0
12,university of lagos,500 Level,Engineering,Surveying & geo-informatics engineering,10,22.0
17,university of lagos,100 Level,Sciences,Marine science,9,18.0
18,university of lagos,400 Level,Engineering,Chemical engineering,10,22.0
23,university of lagos,300 Level,Education,Educational foundations,10,15.0


#### Merging the predicted values

Now let's merge the predicted values to the general dataframe.

Create a copy of the previous dataframe, to prevent errors.

In [229]:
df_clean = data.copy()

Add the predicted values.

In [230]:
changed_rows = invalid_course_units.index

In [231]:
df_clean.loc[changed_rows, "course_unit"] = invalid_course_units["course_unit"]

df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429 entries, 0 to 428
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   school         429 non-null    category
 1   level          429 non-null    category
 2   age            429 non-null    Int64   
 3   gender         429 non-null    category
 4   relationship   429 non-null    category
 5   faculty        429 non-null    category
 6   department     429 non-null    category
 7   strike_effect  343 non-null    string  
 8   challenge      338 non-null    string  
 9   work           429 non-null    category
 10  skills         429 non-null    string  
 11  prep_before    429 non-null    category
 12  prep_after     429 non-null    category
 13  lecture        429 non-null    category
 14  academic_act   429 non-null    category
 15  courses_taken  429 non-null    Int64   
 16  course_unit    429 non-null    Int64   
 17  cgpa_before    429 non-null    Floa

## Addressing cases of No CGPA

Since the 2022 ASUU strike occured in the first semester, we expect that newly admitted students won't have a CGPA before the strike. In such cases, the participants were told to input "0".

In [232]:
#Extract a dataframe of individuals who had no cgpa before but cgpa after

no_cgpa_before = df_clean[df_clean['cgpa_before'] == 0]

no_cgpa_before.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 109 entries, 4 to 427
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   school         109 non-null    category
 1   level          109 non-null    category
 2   age            109 non-null    Int64   
 3   gender         109 non-null    category
 4   relationship   109 non-null    category
 5   faculty        109 non-null    category
 6   department     109 non-null    category
 7   strike_effect  83 non-null     string  
 8   challenge      81 non-null     string  
 9   work           109 non-null    category
 10  skills         109 non-null    string  
 11  prep_before    109 non-null    category
 12  prep_after     109 non-null    category
 13  lecture        109 non-null    category
 14  academic_act   109 non-null    category
 15  courses_taken  109 non-null    Int64   
 16  course_unit    109 non-null    Int64   
 17  cgpa_before    109 non-null    Floa

### No CGPA After? 
At the time of this survey, some students were yet to see their results. Hence, they filled in 0 in the CGPA After column. While their comments may be useful, we can't use the value they inputed in analysis. Therefore, we have to address this.

What proportion of students in the dataframe are victims of this?

In [233]:
percent_affected = len(df_clean[df_clean['cgpa_after'] == 0])/len(data) * 100

f"About {round(percent_affected)}% of students were affected"

'About 9% of students were affected'

How many students have complete info?

In [234]:
len(data[(df_clean['cgpa_before'] != 0.00) & (df_clean['cgpa_after'] != 0)])

313

### Valid CGPA

Now we have a dataframe of students with valid CGPAs. That is, neither CGPA before nor CGPA after = 0.

In [235]:
df_valid_cgpa = df_clean[(df_clean['cgpa_before'] != 0.00) & (df_clean['cgpa_after'] != 0)]

In [236]:
#Checking data information
df_valid_cgpa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 313 entries, 0 to 428
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   school         313 non-null    category
 1   level          313 non-null    category
 2   age            313 non-null    Int64   
 3   gender         313 non-null    category
 4   relationship   313 non-null    category
 5   faculty        313 non-null    category
 6   department     313 non-null    category
 7   strike_effect  253 non-null    string  
 8   challenge      250 non-null    string  
 9   work           313 non-null    category
 10  skills         313 non-null    string  
 11  prep_before    313 non-null    category
 12  prep_after     313 non-null    category
 13  lecture        313 non-null    category
 14  academic_act   313 non-null    category
 15  courses_taken  313 non-null    Int64   
 16  course_unit    313 non-null    Int64   
 17  cgpa_before    313 non-null    Floa

## Saving our dataframes!

Now that we're done cleaning, it's time to save all our dataframes for use in the analysis files!

In [237]:
# cleaned data before course unit prediction and removal of invalid cgpa
data.to_csv("../data/data_without_predicted_course_units.csv", header=True, index=False)

# cleaned data after course unit prediction
df_clean.to_csv("../data/data_with_predicted_course_units.csv", header=True, index=False)

# contains students lacking complete cgpa info and predicted course units
no_cgpa_before.to_csv("../data/incomplete_cgpa_only.csv", header=True, index=False)

# contains only valid cgpa and predicted course units
df_valid_cgpa.to_csv("../data/valid_cgpa_only.csv", header=True, index=False)