# Exercise Week 2 - exploratory data analysis

Exploring a quantitative variable and its interaction with multiple variables, both quantitative and qualitative. Make a Jupyter Notebook, put it on GitHub and make a link to it.

This exercise contains: Creating Dataframes, Merging dataframes, Creating a distplot, excluding specific data from the dataframe, Creating a mean and median, Creating a boxplot and 

## Import both dataframes and merge the dataframes

Both datasets are placed in to a pandas dataframe. Both datasets need to be merged in a new data frame to create 1 dataset. 

In [19]:
import seaborn as sns #this is the plotting library I'll be using 
import pandas as pd #"as pd" means that we can use the abbreviation in commands
import matplotlib.pyplot as plt #we need Matplotlib for setting the labels in the Seaborn graphs

df1 = pd.read_csv('steps.csv', sep=';', engine='python') # use sep by ';' because the EU annotation
df2 = pd.read_csv('survey.csv')

df = pd.merge(df1, df2, on = "id") #'id' is the common identifier

# The handy Pandas function to_numeric converts non-numeric data to NaN. The last argument overrides the errors
# Python would normally generate.
# The apply method of a dataframe lets us apply a function to all the elements.
df['weight'] = df['weight'].apply(pd.to_numeric, errors='coerce')

#To remove the values above 170 we need to create a function that we then apply to the entire column
def above_170(x):
    if(x > 170): 
        return float('NaN')
    else: 
        return x
df['weight'] = df['weight'].apply(above_170)

#To remove the values below 40 we need to create a function that we then apply to the entire column
def below_40(x):
    if(x < 40): 
        return float('NaN')
    else: 
        return x
df['weight'] = df['weight'].apply(below_40)

df.head() # this shows the head of the dataframe in the output

Unnamed: 0,id,20-6-2013,21-6-2013,22-6-2013,23-6-2013,24-6-2013,25-6-2013,26-6-2013,27-6-2013,28-6-2013,...,12-5-2014,13-5-2014,city,gender,age,hh_size,education,education_1,height,weight
0,1,,,,,3941.0,15733.0,9929.0,12879.0,10541.0,...,,,Bordeaux,Male,25-34,4,4,Master or doctorate,178.0,98.0
1,2,,,10473.0,705.0,4287.0,5507.0,4024.0,3926.0,14595.0,...,,,Lille,Male,35-44,1,3,Bachelor,180.0,77.0
2,3,,11428.0,12523.0,2553.0,190.0,2164.0,8185.0,9630.0,8983.0,...,1129.0,,Montpellier,Male,25-34,2,2,Master or doctorate,180.0,83.0
3,4,,,,,,,,,,...,,,Lyon,Male,<25,1,1,Bachelor,178.0,80.0
4,5,,,,,,,,,,...,,,Montpellier,Female,25-34,3,4,Bachelor,167.0,61.0


In [20]:
df.info() #Get information on the variables

<class 'pandas.core.frame.DataFrame'>
Int64Index: 929 entries, 0 to 928
Columns: 337 entries, id to weight
dtypes: float64(330), int64(3), object(4)
memory usage: 2.4+ MB


## Create the first plot

This plot is created to determine the outliers of the dataset. In the plot some input at 700 can be seen. This is likely to be faulty data. 

## Create a Subset dataframe 

In [21]:
#df_subset = df[["age","height","weight", ""]]
# df_subset = df_subset[(df_subset["age"] > 30000) & (df_subset["height"] > 10) & (df_subset["weight"] > 1500)] 
# df_subset = df_subset[["age", "height", "weight"]]

#df_subset.head()

## Calculate 

In [27]:
# calculate some statistics for height
print('median: ' + str(float(df[['height']].median())))
print('mode: ' + str(df[['height']].mode()['height'][0]))
print('mean: ' + str(float(df[['height']].mean())))
print('standard deviation: ' + str(float(df[['height']].std())))
print('variance: ' + str(float(df[['height']].var())))

median: 172.0
mode: 170.0
mean: 171.66810344827587
standard deviation: 9.080235412579503
variance: 82.45067514786287


In [28]:
mean_steps = df.loc[:,"20-6-2013":"13-5-2014"].mean(axis = 1).to_frame() #get the mean over the columns (all the dates) 
#and store
mean_steps.index #the indexes now contain the dates

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            919, 920, 921, 922, 923, 924, 925, 926, 927, 928],
           dtype='int64', length=929)

In [29]:
mean_steps

Unnamed: 0,0
0,10205.521212
1,5687.423313
2,8301.729730
3,3633.200000
4,5312.129630
...,...
924,6282.131868
925,4799.880000
926,10030.326829
927,15679.679012


In [31]:
mean_steps.rename(columns={0:'mean_steps'}, inplace = True)
mean_steps


Unnamed: 0,mean_steps
0,10205.521212
1,5687.423313
2,8301.729730
3,3633.200000
4,5312.129630
...,...
924,6282.131868
925,4799.880000
926,10030.326829
927,15679.679012


### Correlation Matrix

In [None]:
corr = df_subset.corr() 
corr

In [None]:
sns.pairplot(df_subset)
plt.show()

## Summary of the statistics 
The "mean" is the "average" we're used to, where you add up all the numbers and then divide by the number of numbers. Within this dataframe the mean can be seen in fig 5 and is rounded 72.33 kg. 

Where as, the "median" is the "middle" value in the list of numbers. To find the median, your numbers have to be listed in numerical order from smallest to largest, so you may have to rewrite your list before you can find the median. The median of this dataframe can also be seen in fig 5 and is 71.0 kg. 

## Discussion
Verbal description of the distribution, including an investigation into its normality, skewness, outliers, etc

Within the first figure (Fig 1) some outliers can be seen at 700 and under 40. A healthy weight chart or BMI chart from NHS (2020) shows that 40 kg is the lowest included weight. Where as 170 kg is the highest included weight. Therefore, the outliers most likely exist of a faulty data entry. 

NHS (2020, April 7). Height and weight chart. Nhs.Uk. https://www.nhs.uk/live-well/healthy-weight/height-weight-chart/