# Christopher Reid - Homework 4

In [1]:
# Module Imports
import csv
import pandas as pd

# Part 1: Setup

## 1.1: Dataset Selection

source dataset: https://www.kaggle.com/datasets/atharvbharaskar/reading-habits-of-students

# Part 2: Data Collection

## 2.1: Dataset Description

My chosen dataset belonging to Kaggle user, Atharv Bharaskar surveyed college students located in India about their usage of the campus libraries, reading habits, learning preferences and other factors focused on reading. The survey was shared with students at the campus’s library and their responses were recorded using Google Forms. Between 10 to 15 thousand students ranging from 11th grade to master’s degree programs across a wide variety of courses participated. The dataset aims to capture insights on students’ reading and study habits as well as library usage and preferential methods to learning. 

The columns are student gender, departmental affiliation, census location category, preferred type of books for studying, frequency of library visits, purposes for library visits, average time spent in college, general purpose of library visits, preferred location for studying, preferred time of library visits, preference of language for learning, preferred type of reading material, whether they enjoy reading or not, and preferred mode of learning. Additional columns include students’ reported impacts of Covid on learning, how often they study, reported study habits prior to college, study habits post-college, awareness levels about the National Digital Library, usage of the NDL, reported pandemic effects on reading habits, book purchasing behavior from physical stores, average expenditure on books, father's occupation, and education level of their parents. 

With no creation date listed on the page, the dataset was last updated in May of 2023 and no further expected update frequency.

## 2.2: Dataset Download

In [2]:
filename = 'Reading Habbit Of Students.csv'
reading_df = pd.read_csv(filename)
reading_df.sample(2)

Unnamed: 0.1,Unnamed: 0,gender,faculty,Enter Your Location,kind of books preffered for study,How Frequently do you visit library,For what Purposes do yo visit library,Average Time spent in collage,What is general Purposes,Which one is your Prefered location,...,Dose Covid 19 Pandemic Affected Your Reading Habits,Do you purchase Books from store,Average Expenditure on books,Occupation Of Father,Parents Education,Select your Faculty,Enter your Location,Preferred Language for Learning,Do you Using National dig,Occupation of Father
167,167,Female,Science,Rural,Reference Books,Once in a week,For Reading Novels and,g Less than an hour,To get the knowledge and,Central Library,...,Yes - Negatively Affected,No,Less Than Rs. 500,Farmer,Educated,Science,Rural,Marathi,No,Farmer
42,42,Male,Science,Rural,Text Books,2-3 times in a week,To borrow library material,s 2-4 hours,To pass the examination,College lab,...,Yes - Positively Affected,Yes,Less Than Rs. 500,Farmer,Educated,Science,Rural,English,No,Farmer


# Part 3: Data Preprocessing

## 3.1: Data Cleaning

### Checking for NaN values

In [3]:
reading_df.isnull().sum()

Unnamed: 0                                             0
gender                                                 0
faculty                                                0
Enter Your Location                                    0
kind of books preffered for study                      0
How Frequently do you visit library                    0
For what Purposes do yo visit library                  0
Average Time spent in collage                          0
What is general Purposes                               0
Which one is your Prefered location                    0
What is your preferred time?                           0
Preferred language for Learning                        0
Preferred type for reading                             0
Do you enjoy the Reading                               0
Which mode of learning /                               0
Dose Covid Pandemic Ch                                 0
How do you study before collage                        0
How do you study after Collage 

## 3.2: Outlier Detection

In [4]:
labels = reading_df.columns.to_list()
for label in range(len(labels)):
    curr_label = labels[label]
    print('Column Name: ' + str(curr_label))
    print(reading_df[curr_label].value_counts().to_list())
    print()

Column Name: Unnamed: 0
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Column Name: gender
[159, 69]

Column Name: faculty
[209, 7, 6, 6]

Column Name: Enter Your Location
[182, 46]

Column Name: kind of books preffered for study
[77, 76, 39, 21, 15]

Column Name: How Frequently do you visit library
[71, 69, 36, 33, 19]

Column Name: For what Purposes do yo v

## 3.3: Data Quality and Consistency

The initial dataframe had a few errors that required modification before progressing further. The csv file contained a column that essentially served the purpose of maintaining an index of each row of entries so I first had to remove that column along with duplicate columns towards the back end of the table. Some of the labels were not immediately intuitive so another transformation involved renaming some of the column labels to more intuitive and concise naming conventions as well as standardizing the format to assist with using the strings for data processing in the future. Lastly, it was apparent that within some columns, there was commingling amongst the categories and responses of the same value but with slight differences in the strings. Each column with these errors observed was cleaned and standardized. 

### Removing duplicte columns from csv formatting

In [5]:
reading_df.columns

Index(['Unnamed: 0', 'gender', 'faculty', 'Enter Your Location',
       'kind of books preffered for study',
       'How Frequently do you visit library',
       'For what Purposes do yo visit library',
       'Average Time spent in collage', 'What is general Purposes',
       'Which one is your Prefered location ', 'What is your preferred time?',
       'Preferred language for Learning', 'Preferred type for reading',
       'Do you enjoy the Reading', 'Which mode of learning /',
       'Dose Covid Pandemic Ch', 'How do you study before collage',
       'How do you study after Collage', 'Do you aware about Nati',
       'Do you Using National di',
       'Dose Covid 19 Pandemic Affected Your Reading Habits',
       'Do you purchase Books from store', 'Average Expenditure on books',
       'Occupation Of Father', 'Parents Education', 'Select your Faculty',
       'Enter your Location', 'Preferred Language for Learning',
       'Do you Using National dig', 'Occupation of Father'],
      

In [6]:
removed_labels = ['Unnamed: 0','Select your Faculty',
       'Enter your Location', 'Preferred Language for Learning',
       'Do you Using National dig', 'Occupation of Father']
reading_df.drop(labels=removed_labels, inplace=True, axis=1)
reading_df.columns

Index(['gender', 'faculty', 'Enter Your Location',
       'kind of books preffered for study',
       'How Frequently do you visit library',
       'For what Purposes do yo visit library',
       'Average Time spent in collage', 'What is general Purposes',
       'Which one is your Prefered location ', 'What is your preferred time?',
       'Preferred language for Learning', 'Preferred type for reading',
       'Do you enjoy the Reading', 'Which mode of learning /',
       'Dose Covid Pandemic Ch', 'How do you study before collage',
       'How do you study after Collage', 'Do you aware about Nati',
       'Do you Using National di',
       'Dose Covid 19 Pandemic Affected Your Reading Habits',
       'Do you purchase Books from store', 'Average Expenditure on books',
       'Occupation Of Father', 'Parents Education'],
      dtype='object')

### Standardizing column labels

In [7]:
renaming_labels = {
    'gender': 'Gender',
    'faculty':'Department',
    'Enter Your Location': 'Urbanization level',
    'kind of books preffered for study': 'Preffered study material',
    'How Frequently do you visit library': 'Library visit frequency',
    'For what Purposes do yo visit library': 'Library visit purpose',
    'Average Time spent in collage': 'Libaray visit length',
    'What is general Purposes': 'Purpose for attending college',
    'Which one is your Prefered location ':'Preffered study location',
    'What is your preferred time?':'Preffered study time',
    'Preferred type for reading':'Preffered reading material',
    'Do you enjoy the Reading':'Reading satisfaction',
    'Which mode of learning /':'Preffered learning modality',
    'Dose Covid Pandemic Ch':'Covid pandemic impact - learning',
    'How do you study before collage':'Pre-College study habits',
    'How do you study after Collage':'Post-College study habits',
    'Do you aware about Nati':'National Digital Library awareness',
    'Do you Using National di':'National Digital Library usage',
    'Dose Covid 19 Pandemic Affected Your Reading Habits':'Covid pandemic impact - reading',
    'Do you purchase Books from store':'Book store customer',
    'Average Expenditure on books':'Avg book cost - rupees',
    'Occupation Of Father':'Father\'s Occupation',
    'Parents Education':'Parent\'s Education Level'
}

reading_df.rename(columns = renaming_labels, inplace=True)
reading_df.columns

Index(['Gender', 'Department', 'Urbanization level',
       'Preffered study material', 'Library visit frequency',
       'Library visit purpose', 'Libaray visit length',
       'Purpose for attending college', 'Preffered study location',
       'Preffered study time', 'Preferred language for Learning',
       'Preffered reading material', 'Reading satisfaction',
       'Preffered learning modality', 'Covid pandemic impact - learning',
       'Pre-College study habits', 'Post-College study habits',
       'National Digital Library awareness', 'National Digital Library usage',
       'Covid pandemic impact - reading', 'Book store customer',
       'Avg book cost - rupees', 'Father's Occupation',
       'Parent's Education Level'],
      dtype='object')

### Combining duplicate strings/non-uniform entries

***
**Library visit purpose**

In [8]:
reading_df['Library visit purpose'].unique()

array(['For Reading Novels and', 'To Complete Assignment',
       'To read Daily News pape', 'To borrow library material',
       'To read Daily News pape r', 'To photo copy reading m',
       'Reference books', 'To use the internet',
       'To photo copy reading ma', 'Other', 'To Read Daily News Pap'],
      dtype=object)

In [9]:
def format_libvisit_purpose(inp_str):
    remapping = {
        'For Reading Novels and': 'Reading novels',
        'To read Daily News pape': 'Reading daily newspaper',
        'To read Daily News pape r': 'Reading daily newspaper',
        'To Read Daily News Pap': 'Reading daily newspaper',
        'To photo copy reading m': 'Photo copying material',
        'To photo copy reading ma': 'Photo copying material',
        'To Complete Assignment': 'Completing assignments',
        'To use the internet': 'Intenet use',
        'To borrow library material': 'Borrow library material'
    }

    return remapping.get(inp_str, inp_str)

In [10]:
reading_df['Library visit purpose'] = reading_df['Library visit purpose'].apply(lambda inp_str: format_libvisit_purpose(inp_str))
reading_df['Library visit purpose'].unique()

array(['Reading novels', 'Completing assignments',
       'Reading daily newspaper', 'Borrow library material',
       'Photo copying material', 'Reference books', 'Intenet use',
       'Other'], dtype=object)

***
**Libaray visit length**

In [11]:
reading_df['Libaray visit length'].unique()

array(['s 2-4 hours', 's Less than an hour', 'Less than an hour',
       '2-4 hours', 's 6 and Above', 'g Less than an hour', 'g 2-4 hours',
       'a 2-4 hours', 'a Less than an hour', 'g 6 and Above',
       'e 5-6 hours', 'e Less than an hour', 'g 5-6 hours', '5-6 hours',
       '6 and Above', 'e 2-4 hours', 'To borrow library materials'],
      dtype=object)

In [12]:
def format_libvisit_length(inp_str):
    remapping = {
        's 2-4 hours': '2-4 hrs',
        's Less than an hour': '<1 hr',
        'Less than an hour': '<1 hr',
        '2-4 hours': '2-4 hrs',
        's 6 and Above': '6+ hrs',
        'g Less than an hour': '<1 hr',
        'g 2-4 hours': '2-4 hrs',
        'a 2-4 hours': '2-4 hrs',
        'a Less than an hour': '<1 hr',
        'g 6 and Above': '6+ hrs',
        'e 5-6 hours': '5-6 hrs',
        'e Less than an hour': '<1 hr',
        'g 5-6 hours': '5-6 hrs',
        '5-6 hours': '5-6 hrs',
        '6 and Above': '6+ hrs',
        'e 2-4 hours': '2-4 hrs',
        'To borrow library materials': '0'
    }
    return remapping.get(inp_str, inp_str)

In [13]:
reading_df['Libaray visit length'] = reading_df['Libaray visit length'].apply(lambda inp_str: format_libvisit_length(inp_str))
reading_df['Libaray visit length'].unique()

array(['2-4 hrs', '<1 hr', '6+ hrs', '5-6 hrs', '0'], dtype=object)

***
**Purpose for attending college**

In [14]:
reading_df['Purpose for attending college'].unique()

array(['To while away time', 'To pass the examination',
       'To get the knowledge and', 'To be well informed',
       'To increase General Kno', 'For career development',
       'To prepare class Assignm', 'For Career Development',
       'To Pass the examination'], dtype=object)

In [15]:
def format_college_purpose(inp_str):
    remapping = {
        'To while away time': 'Idling',
        'To pass the examinatio': 'Passing exam',
        'To pass the examination': 'Passing exam',
        'To increase General Kno': 'Increasing general knowledge',
        'To get the knowledge and': 'Increasing general knowledge',
        'For career development': 'Career development',
        'For Career Development': 'Career development',
        'To prepare class Assignm': 'Class assignment preparation',
        'To Pass the examination': 'Passing exam',
        'To be well informed': 'Becoming well informed'
        
    }
    return remapping.get(inp_str, inp_str)

In [16]:
reading_df['Purpose for attending college'] = reading_df['Purpose for attending college'].apply(lambda inp_str: format_college_purpose(inp_str))
reading_df['Purpose for attending college'].unique()

array(['Idling', 'Passing exam', 'Increasing general knowledge',
       'Becoming well informed', 'Career development',
       'Class assignment preparation'], dtype=object)

***
**Preffered study location**

In [17]:
reading_df['Preffered study location'].unique()

array(['Home', 'Class Room', 'Central Library', 'Park or Garden',
       'w Park or Garden', 'w Home', 'Campus Ground', 'w Campus Ground',
       'College lab', 'w Other places', 'w College lab', 'w Class Room',
       'Other places', 'w Central Library', 'College Lab',
       'To prepare class Assignments', 'Other Places', 'w College Lab'],
      dtype=object)

In [18]:
def format_study_loc(inp_str):
    remapping = {
        'w Park or Garden': 'Park or Garden',
        'w Home': 'Home',
        'w Campus Ground': 'Campus Ground',
        'w Other places': 'Other places',
        'Other Places': 'Other places',
        'w College lab': 'College lab',
        'Class Room': 'Classroom',
        'w Class Room': 'Classroom',
        'w Central Library': 'Central Library',
        'To prepare class Assignments': 'Central Library',
        'w College Lab': 'College lab',
        'College Lab': 'College lab'
    }
    return remapping.get(inp_str, inp_str)

In [19]:
reading_df['Preffered study location'] = reading_df['Preffered study location'].apply(lambda inp_str: format_study_loc(inp_str))
reading_df['Preffered study location'].unique()

array(['Home', 'Classroom', 'Central Library', 'Park or Garden',
       'Campus Ground', 'College lab', 'Other places'], dtype=object)

***
**Preffered reading material**

In [20]:
reading_df['Preffered reading material'].unique()

array(['Stories and novels', 'Magazines', 'Others', 'Text books',
       'Reference books', 'New paper', 'Competitive books',
       'Journal articles', 'New Paper'], dtype=object)

In [21]:
def format_reading_material(inp_str):
    remapping = {
        'New paper': 'Newspaper',
        'New Paper': 'Newspaper'
    }
    return remapping.get(inp_str, inp_str)

In [22]:
reading_df['Preffered reading material'] = reading_df['Preffered reading material'].apply(lambda inp_str: format_reading_material(inp_str))
reading_df['Preffered reading material'].unique()

array(['Stories and novels', 'Magazines', 'Others', 'Text books',
       'Reference books', 'Newspaper', 'Competitive books',
       'Journal articles'], dtype=object)

***
**Preffered learning modality**

In [23]:
reading_df['Preffered learning modality'].unique()

array(['Hard copy of Reading ma', 'Academic Videos',
       'soft copy of Reading mat', 'soft copy of Reading mat e',
       'Hard copy of Reading Ma'], dtype=object)

In [24]:
def format_learning_modality(inp_str):
    remapping = {
        'Hard copy of Reading ma': 'Hard copies',
        'soft copy of Reading mat': 'Soft copies',
        'oft copy of Reading mat e': 'Soft copies',
        'Hard copy of Reading Ma': 'Hard copies',
        'soft copy of Reading mat e': 'Soft copies'
    }
    return remapping.get(inp_str, inp_str)

In [25]:
reading_df['Preffered learning modality'] = reading_df['Preffered learning modality'].apply(lambda inp_str: format_learning_modality(inp_str))
reading_df['Preffered learning modality'].unique()

array(['Hard copies', 'Academic Videos', 'Soft copies'], dtype=object)

***
**Pre-College study habits**

In [26]:
reading_df['Pre-College study habits'].unique()

array(['Reading soft copy of boo k', 'Watching Lecture Videos',
       'Reading books in library a', 'a', 'Reading books in library',
       'Reading soft copy of boo'], dtype=object)

In [27]:
def format_precollege_habit(inp_str):
    remapping = {
        'Reading soft copy of boo k': 'Soft copies',
        'a': 'Reading books in library',
        'Reading books in library a': 'Reading books in library',
        'Reading soft copy of boo': 'Soft copies'
    }
    return remapping.get(inp_str, inp_str)

In [28]:
reading_df['Pre-College study habits'] = reading_df['Pre-College study habits'].apply(lambda inp_str: format_precollege_habit(inp_str))
reading_df['Pre-College study habits'].unique()

array(['Soft copies', 'Watching Lecture Videos',
       'Reading books in library'], dtype=object)

***
**Post-College study habits**

In [29]:
reading_df['Post-College study habits'].unique()

array(['Reading soft copy of boo k', 'Watching Lecture Videos',
       'Reading books in library a', 'a', 'Reading Books in Library'],
      dtype=object)

In [30]:
def format_postcollege_habit(inp_str):
    remapping = {
        'Reading soft copy of boo k': 'Soft copies',
        'a': 'Reading books in library',
        'Reading books in library a': 'Reading books in library',
        'Reading Books in Library': 'Reading books in library'
    }
    return remapping.get(inp_str, inp_str)

In [31]:
reading_df['Post-College study habits'] = reading_df['Post-College study habits'].apply(lambda inp_str: format_postcollege_habit(inp_str))
reading_df['Post-College study habits'].unique()

array(['Soft copies', 'Watching Lecture Videos',
       'Reading books in library'], dtype=object)

***
**Covid pandemic impact - reading**

In [32]:
reading_df['Covid pandemic impact - reading'].unique()

array(['No - Not Affected', 'Yes - Positively Affected',
       'Yes- Negatively Affected', 'Yes, In some way',
       'No - Positively Affected', 'Yes - Negatively Affected'],
      dtype=object)

In [33]:
def format_pandemic_impact_r(inp_str):
    remapping = {
        'Yes- Negatively Affected': 'Yes - Negatively Affected',
        'Yes, In some way' : 'Yes - In some way'
    }
    return remapping.get(inp_str, inp_str)

In [34]:
reading_df['Covid pandemic impact - reading'] = reading_df['Covid pandemic impact - reading'].apply(lambda inp_str: format_pandemic_impact_r(inp_str))
reading_df['Covid pandemic impact - reading'].unique()

array(['No - Not Affected', 'Yes - Positively Affected',
       'Yes - Negatively Affected', 'Yes - In some way',
       'No - Positively Affected'], dtype=object)

## 3.4: Data Quality and Consistency
