

COVID-19 for Data Analysis Project

The Dataset is obtained from Kaggle at the following link: https://www.kaggle.com/datasets/kunal28chaturvedi/covid19-and-its-impact-on-students

COVID-19 is a global pandemic caused by the coronavirus, It presents unique challenges for data analysis, with key focus areas including infection and mortality rates, healthcare impacts, and the effectiveness of control measures. Data analysis in this context is crucial for understanding the pandemic's spread and informing public health decisions.


Educational Disruption: Many schools and universities shifted to remote learning, affecting the quality and accessibility of education. This sudden change disrupted traditional teaching methods and learning experiences.

Social and Emotional Impact: The lack of in-person interaction and extracurricular activities led to increased feelings of isolation and stress among students.

Imports

** Import pandas, numpy, matplotlib,and seaborn. **

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Read in the Covid-19 dataset csv file

In [9]:
df = pd.read_csv('COVID-19 Survey Student Responses.csv')

In [10]:
df.head()

Unnamed: 0,ID,Region of residence,Age of Subject,Time spent on Online Class,Rating of Online Class experience,Medium for online class,Time spent on self study,Time spent on fitness,Time spent on sleep,Time spent on social media,Prefered social media platform,Time spent on TV,Number of meals per day,Change in your weight,Health issue during lockdown,Stress busters,Time utilized,"Do you find yourself more connected with your family, close friends , relatives ?",What you miss the most
0,R1,Delhi-NCR,21,2.0,Good,Laptop/Desktop,4.0,0.0,7.0,3.0,Linkedin,1,4,Increased,NO,Cooking,YES,YES,School/college
1,R2,Delhi-NCR,21,0.0,Excellent,Smartphone,0.0,2.0,10.0,3.0,Youtube,0,3,Decreased,NO,Scrolling through social media,YES,NO,Roaming around freely
2,R3,Delhi-NCR,20,7.0,Very poor,Laptop/Desktop,3.0,0.0,6.0,2.0,Linkedin,0,3,Remain Constant,NO,Listening to music,NO,YES,Travelling
3,R4,Delhi-NCR,20,3.0,Very poor,Smartphone,2.0,1.0,6.0,5.0,Instagram,0,3,Decreased,NO,Watching web series,NO,NO,"Friends , relatives"
4,R5,Delhi-NCR,21,3.0,Good,Laptop/Desktop,3.0,1.0,8.0,3.0,Instagram,1,4,Remain Constant,NO,Social Media,NO,NO,Travelling


Data Cleaning




Correcting Non-Numeric Values:
    •Convert 'Time spent on TV' to numeric, setting non-numeric values to NaN.
    •Then, fill these NaN values with the median of the column.
Standardizing Text Data:
    •Standardize the 'Prefered social media platform' entries by converting them  
    •all to lowercase. This helps in handling inconsistencies like 'Facebook' vs 'facebook'.

In [14]:
# Correcting Non-Numeric Values in 'Time spent on TV'
# Converting 'Time spent on TV' to numeric, setting non-numeric values to NaN
df['Time spent on TV'] = pd.to_numeric(df['Time spent on TV'], errors='coerce')

# Filling NaN values with the median
median_tv_time = df['Time spent on TV'].median()
df['Time spent on TV'].fillna(median_tv_time, inplace=True)

# Standardizing Text Data in 'Prefered social media platform'
# Converting all entries to lowercase
df['Prefered social media platform'] = df['Prefered social media platform'].str.lower()

# Displaying the cleaned data for these columns
df[['Time spent on TV', 'Prefered social media platform']].head()



Unnamed: 0,Time spent on TV,Prefered social media platform
0,1.0,linkedin
1,0.0,youtube
2,0.0,linkedin
3,0.0,instagram
4,1.0,instagram


Remove Duplicate Rows

In [19]:
df = df.drop_duplicates()

 Convert all text to lowercase for 'Prefered social media platform' column

In [39]:
df['Prefered social media platform'] = df['Prefered social media platform'].str.lower()

Convert 'Yes'/'No' answers to boolean for 'Health issue during lockdown'

In [40]:
df['Health issue during lockdown'] = df['Health issue during lockdown'].map({'YES': 1, 'NO': 0})

Replace missing values in 'Time spent on Online Class' with the mean

In [41]:
df['Time spent on Online Class'] = df['Time spent on Online Class'].fillna(df['Time spent on Online Class'].mean())

Correct non-numeric values in 'Time spent on TV'

In [42]:
df['Time spent on TV'] = pd.to_numeric(df['Time spent on TV'], errors='coerce')
df['Time spent on TV'] = df['Time spent on TV'].fillna(df['Time spent on TV'].median())

Trim whitespace from all string columns

In [43]:
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

Rename columns to replace spaces with underscores and lowercase

In [45]:
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]

Convert ratings to ordinal scale

In [46]:
ratings_scale = {'Very poor': 1, 'Poor': 2, 'Average': 3, 'Good': 4, 'Excellent': 5}
df['rating_of_online_class_experience'] = df['rating_of_online_class_experience'].map(ratings_scale)

Fill missing values in 'Time spent on self study' with median

In [47]:
df['time_spent_on_self_study'] = df['time_spent_on_self_study'].fillna(df['time_spent_on_self_study'].median())

Display the cleaned DataFrame

In [48]:
df.head()

Unnamed: 0,id,region_of_residence,age_of_subject,time_spent_on_online_class,rating_of_online_class_experience,medium_for_online_class,time_spent_on_self_study,time_spent_on_fitness,time_spent_on_sleep,time_spent_on_social_media,prefered_social_media_platform,time_spent_on_tv,number_of_meals_per_day,change_in_your_weight,health_issue_during_lockdown,stress_busters,time_utilized,"do_you_find_yourself_more_connected_with_your_family,_close_friends_,_relatives__?",what_you_miss_the_most
0,R1,Delhi-NCR,21,2.0,4.0,Laptop/Desktop,4.0,0.0,7.0,3.0,linkedin,1.0,4,Increased,0,Cooking,YES,YES,School/college
1,R2,Delhi-NCR,21,0.0,5.0,Smartphone,0.0,2.0,10.0,3.0,youtube,0.0,3,Decreased,0,Scrolling through social media,YES,NO,Roaming around freely
2,R3,Delhi-NCR,20,7.0,1.0,Laptop/Desktop,3.0,0.0,6.0,2.0,linkedin,0.0,3,Remain Constant,0,Listening to music,NO,YES,Travelling
3,R4,Delhi-NCR,20,3.0,1.0,Smartphone,2.0,1.0,6.0,5.0,instagram,0.0,3,Decreased,0,Watching web series,NO,NO,"Friends , relatives"
4,R5,Delhi-NCR,21,3.0,4.0,Laptop/Desktop,3.0,1.0,8.0,3.0,instagram,1.0,4,Remain Constant,0,Social Media,NO,NO,Travelling


Data Manipulation

Data Visualization:

Hypothesis Testing and Statistical Analysis

Advanced analysis