# Moses Permaul - IS362 - Project 2 - Wide Data 1


This dataset contains various information pertaining to students and their term results for two tests. The data itself contains multiple variables (ex: sex and age) within the same column. It also has a mix of variables within other columns that are both calculable and non-calculable. These are columns such as name and phone.

After cleaning this data, we will find the following:
1. Test Score Averages
2. Test Score Sums

### Python Code for Imports and Reading the Data
To begin, we will import the standard libraries needed, read in the data, and display the first 5 rows.

In [1]:
# standard imports for numpy and pandas
import numpy as np
import pandas as pd

# read wide data to DataFrame
raw_data = pd.read_csv('data/wide_data_1.csv')

# make a copy of DataFrame to preserve origanal import
wide_data = raw_data.copy()

# view first 5 rows to understand the data
wide_data.head()

Unnamed: 0,id,name,phone,sex and age,test number,term 1,term 2,term 3
0,1,Mike,134,m_12,test 1,76,84,87
1,2,Linda,270,f_13,test 1,88,90,73
2,3,Sam,210,m_11,test 1,78,74,80
3,4,Esther,617,f_12,test 1,68,75,74
4,5,Mary,114,f_14,test 1,65,67,64


### Cleaning the Data
Before we start trying to clean the data, we will frist look at the data types within the "wide_data" DataFrame.

In [2]:
# view data types for the DataFrame
wide_data.dtypes

id              int64
name           object
phone           int64
sex and age    object
test number    object
term 1          int64
term 2          int64
term 3          int64
dtype: object

The first thing that we notice is that the column "**sex and age**" is a combination of 2 different pieces of information. We have the person's gender followed by an underscore, and then their age. 

To clean this, we will first **seperate this single column of data into 2 different ones** and **add it to the dataframe**. The age will be a string instead of int, so we will also **convert** this. With the column being split, we will **remove the "sex and age" column** as we no longer need it.

In [3]:
# split column and add to data
wide_data[['sex', 'age']] = wide_data['sex and age'].str.split('_', expand=True)

# use to_numeric to convert age column to int
wide_data['age'] = pd.to_numeric(wide_data['age'])

# delete 'sex and age' column
del wide_data['sex and age']

# display the dataframe
wide_data

Unnamed: 0,id,name,phone,test number,term 1,term 2,term 3,sex,age
0,1,Mike,134,test 1,76,84,87,m,12
1,2,Linda,270,test 1,88,90,73,f,13
2,3,Sam,210,test 1,78,74,80,m,11
3,4,Esther,617,test 1,68,75,74,f,12
4,5,Mary,114,test 1,65,67,64,f,14
5,1,Mike,134,test 2,85,80,90,m,12
6,2,Linda,270,test 2,87,82,94,f,13
7,3,Sam,210,test 2,80,87,80,m,11
8,4,Esther,617,test 2,70,75,78,f,12
9,5,Mary,114,test 2,68,70,63,f,14


Next, we will look at the "**phone**" column. This **data is currently an int**, but shouldn't since we normally don't do mathmatical operations on phone numbers. We will convert it to a string.

In [4]:
wide_data['phone'] = wide_data['phone'].astype(str)

Looking at the data types now, we should now have clean data to work with.

In [5]:
# view data types for the DataFrame
wide_data.dtypes

id              int64
name           object
phone          object
test number    object
term 1          int64
term 2          int64
term 3          int64
sex            object
age             int64
dtype: object

### Analyzing the Data
1. The first calculation we're going to make is the avearage of each test for each term.

In [6]:
# average test score for each term
wide_data[['test number','term 1', 'term 2', 'term 3']].groupby('test number').mean()

Unnamed: 0_level_0,term 1,term 2,term 3
test number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test 1,75.0,78.0,75.6
test 2,78.0,78.8,81.0


    a) The average score for test 1 was the highest during term 2.
    b) The averge score for test 2 was the highest during term 3.

2. Another useful average that we will find is the avearage test scores per person for each term. 

In [7]:
wide_data[['name', 'term 1', 'term 2', 'term 3']].groupby('name').mean()

Unnamed: 0_level_0,term 1,term 2,term 3
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Esther,69.0,75.0,76.0
Linda,87.5,86.0,83.5
Mary,66.5,68.5,63.5
Mike,80.5,82.0,88.5
Sam,79.0,80.5,80.0


    Looking at the above, we can see that Mary had the lowest averages amongst the group.

3. Lastly, we'll find the sum of each test scores within each term.

In [8]:
wide_data[['test number', 'term 1', 'term 2', 'term 3']].groupby(['test number']).sum()

Unnamed: 0_level_0,term 1,term 2,term 3
test number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test 1,375,390,378
test 2,390,394,405


We can see that **test 1 had the highest accumulated score during term 2**, and **test 2 had the highest accumulated score during term 3**.