# Exploratory Data Analysis (EDA)

### **All exercises will be using your chosen dataset**









## Step 1: Import your dataset using the tutorial from the slides

## Step 2: Install necessary libraries. We are using Pandas and NumPy for week 1

In [None]:
!pip install pandas numpy scikit-learn
import pandas as pd
import numpy as np

# Using scikit learn's diabetes dataset for the solutions
from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes_sklearn = load_diabetes()

# Convert the dataset to a DataFrame
df = pd.DataFrame(data=diabetes_sklearn.data,
                           columns=diabetes_sklearn.feature_names)

# Add target variable to the DataFrame
df['target'] = diabetes_sklearn.target




## Exercises

### Pandas and NumPy Docs for Reference

#### [Pandas Documentation](https://pandas.pydata.org/docs/)

#### [NumPy Documentation](https://numpy.org/devdocs/)

## Pandas

### Exercise 1: Previewing The Dataset

In [None]:
# TODO: Read dataset into a Pandas dataframe

# df = pd.read_csv('name_of_dataset.csv')

In [None]:
# TODO: print the first 7 rows
df.head(7)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346,97.0
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062917,-0.038357,138.0


In [None]:
# TODO: print the last 5 rows
df.tail()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
437,0.041708,0.05068,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.05068,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485,104.0
439,0.041708,0.05068,-0.015906,0.017293,-0.037344,-0.01384,-0.024993,-0.01108,-0.046883,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.02656,0.044529,-0.02593,220.0
441,-0.045472,-0.044642,-0.07303,-0.081413,0.08374,0.027809,0.173816,-0.039493,-0.004222,0.003064,57.0


In [None]:
# TODO: print 5 random rows
df.sample(5)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
356,-0.005515,0.05068,-0.033151,-0.015999,0.008063,0.016222,0.015505,-0.002592,-0.028323,-0.075636,54.0
15,-0.052738,0.05068,-0.018062,0.080401,0.089244,0.107662,-0.039719,0.108111,0.03606,-0.042499,171.0
323,0.070769,0.05068,-0.007284,0.049415,0.060349,-0.004445,-0.054446,0.108111,0.129021,0.056912,248.0
175,0.067136,-0.044642,-0.03854,-0.026328,-0.03184,-0.026366,0.008142,-0.039493,-0.027129,0.003064,127.0
77,-0.096328,-0.044642,-0.036385,-0.074527,-0.03872,-0.027618,0.015505,-0.039493,-0.074093,-0.001078,200.0


Write what each row and what each column represents:

The columns represent attributes/features of the dataset and the rows represent data points/items.

### Exercise 2: Examining features/shape

In [None]:
# TODO: Print dataset shape
df.shape

(442, 11)

In [None]:
# TODO: Print the names of the columns
df.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6',
       'target'],
      dtype='object')

In [None]:
# TODO: Drop any identifier columns like names, id, etc...

# df.drop(columns=['Identifier'], inplace=True)

In [None]:
# TODO: Print all the data types of the columns
df.dtypes

Unnamed: 0,0
age,float64
sex,float64
bmi,float64
bp,float64
s1,float64
s2,float64
s3,float64
s4,float64
s5,float64
s6,float64


In [None]:
# TODO: Find the counts of each data type
df.dtypes.value_counts()

Unnamed: 0,count
float64,11


### Exercise 3: Missing/Duplicate Values

In [None]:
# TODO: Find the number of missing values per column
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
bmi,0
bp,0
s1,0
s2,0
s3,0
s4,0
s5,0
s6,0


In [None]:
# TODO: identify the column with the most missing values

df.isna().sum().idxmax()

'age'

In [None]:
# TODO: Either drop the rows with missing values or impute them
df.dropna(inplace=True) # inplace modifies the dataframe

In [None]:
# TODO: Drop duplicate rows if there are any
df.drop_duplicates(inplace=True)

### Exercise 4: Statistics

In [None]:
# TODO: Choose one numeric column and store its name in a variable
col = 'age'

In [None]:
# TODO: print a summary of statistics using .describe() for that column
df[col].describe()

Unnamed: 0,age
count,442.0
mean,-2.511817e-19
std,0.04761905
min,-0.1072256
25%,-0.03729927
50%,0.00538306
75%,0.03807591
max,0.1107267


In [None]:
# TODO: Compare the mean and median of the column
df[col].mean() - df[col].median()

np.float64(-0.005383060374248237)

What can the mean and median of this column tell you about the distribution of data?

The mean is less than the median which implies that the distribution is left-skewed.

### Exercise 5: Filtering/GroupBy

In [None]:
# TODO: From any of the columns, look for suspicious values
# (like negative numbers)

df[df['age'] > 0]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930,141.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
7,0.063504,0.050680,-0.001895,0.066629,0.090620,0.108914,0.022869,0.017703,-0.035816,0.003064,63.0
8,0.041708,0.050680,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014960,0.011349,110.0
...,...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078,66.0
432,0.009016,-0.044642,0.055229,-0.005670,0.057597,0.044719,-0.002903,0.023239,0.055686,0.106617,173.0
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059,49.0
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0


In [None]:
# TODO: From any categorical column, create and display a dataset that only
# contains one of the categories

# Create categorical variable from bmi
df['bmi_categories'] = pd.cut(df['bmi'], bins=[-1, 0, 0.1, 1], labels=['underweight', 'normal', 'overweight'])

df[df['bmi_categories'] == 'overweight']

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target,bmi_categories
32,0.034443,0.05068,0.125287,0.028758,-0.053855,-0.0129,-0.102307,0.108111,0.000272,0.027917,341.0,overweight
114,0.023546,-0.044642,0.110198,0.063187,0.013567,-0.032942,-0.024993,0.020655,0.099241,0.023775,258.0,overweight
138,0.034443,0.05068,0.111276,0.076958,-0.03184,-0.033881,-0.021311,-0.002592,0.02802,0.07348,336.0,overweight
145,-0.04184,-0.044642,0.128521,0.063187,-0.033216,-0.032629,0.011824,-0.039493,-0.015999,-0.050783,259.0,overweight
256,-0.049105,-0.044642,0.160855,-0.046985,-0.029088,-0.01979,-0.047082,0.034309,0.02802,0.011349,346.0,overweight
262,-0.016412,0.05068,0.127443,0.097615,0.016318,0.017475,-0.021311,0.034309,0.034866,0.003064,308.0,overweight
327,0.074401,-0.044642,0.114509,0.028758,0.024574,0.024991,0.019187,-0.002592,-0.000612,-0.00522,237.0,overweight
332,0.030811,-0.044642,0.104809,0.076958,-0.011201,-0.011335,-0.058127,0.034309,0.057108,0.036201,270.0,overweight
362,0.019913,0.05068,0.104809,0.070072,-0.035968,-0.026679,-0.024993,-0.002592,0.003709,0.040343,321.0,overweight
366,-0.045472,0.05068,0.137143,-0.015999,0.041086,0.03188,-0.043401,0.07121,0.071019,0.048628,233.0,overweight


In [None]:
# TODO: Using that same categorical column, group by that column and find how
# many rows belong to each of the groups

df.groupby('bmi_categories').count()

  df.groupby('bmi_categories').count()


Unnamed: 0_level_0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
bmi_categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
underweight,247,247,247,247,247,247,247,247,247,247,247
normal,183,183,183,183,183,183,183,183,183,183,183
overweight,12,12,12,12,12,12,12,12,12,12,12


Determine which group shows up the least and which shows up the most:

## NumPy

### Exercise 6 NumPy Operations



In [None]:
# TODO: Choose two numeric columns and put them into a list variable
cols = ['age', 'bmi']

In [None]:
# TODO: Convert those columns into a numpy matrix (ndarray)
mat = df[cols].to_numpy()
mat

array([[ 0.03807591,  0.06169621],
       [-0.00188202, -0.05147406],
       [ 0.08529891,  0.04445121],
       [-0.08906294, -0.01159501],
       [ 0.00538306, -0.03638469],
       [-0.09269548, -0.04069594],
       [-0.04547248, -0.04716281],
       [ 0.06350368, -0.00189471],
       [ 0.04170844,  0.06169621],
       [-0.07090025,  0.03906215],
       [-0.09632802, -0.08380842],
       [ 0.02717829,  0.01750591],
       [ 0.01628068, -0.02884001],
       [ 0.00538306, -0.00189471],
       [ 0.04534098, -0.02560657],
       [-0.05273755, -0.01806189],
       [-0.00551455,  0.04229559],
       [ 0.07076875,  0.01211685],
       [-0.0382074 , -0.0105172 ],
       [-0.02730979, -0.01806189],
       [-0.04910502, -0.05686312],
       [-0.0854304 , -0.02237314],
       [-0.0854304 , -0.00405033],
       [ 0.04534098,  0.06061839],
       [-0.06363517,  0.03582872],
       [-0.06726771, -0.01267283],
       [-0.10722563, -0.07734155],
       [-0.02367725,  0.05954058],
       [ 0.05260606,

In [None]:
# TODO: Print the shape of the data
mat.shape

(442, 2)

### Exercise 7: More Statistics

In [None]:
# TODO: Compute the mean of each columns (axis = 0)
mat.mean(axis=0)

array([-1.44429466e-18, -2.25592546e-16])

In [None]:
# TODO: Compute the standard deviation of the columns
mat.std(axis=0)

array([0.04756515, 0.04756515])

In [None]:
# TODO: Compute the mean of each row (axis = 1)
mat.mean(axis=1)

array([ 0.04988606, -0.02667804,  0.06487506, -0.05032898, -0.01550082,
       -0.06669571, -0.04631765,  0.03080448,  0.05170233, -0.01591905,
       -0.09006822,  0.0223421 , -0.00627967,  0.00174418,  0.00986721,
       -0.03539972,  0.01839052,  0.0414428 , -0.0243623 , -0.02268584,
       -0.05298407, -0.05390177, -0.04474037,  0.05297969, -0.01390323,
       -0.03997027, -0.09228359,  0.01793167,  0.01565537,  0.03046513,
       -0.00777571, -0.04458143,  0.07986524, -0.00979271, -0.02352466,
        0.00898894,  0.01777155,  0.00094597,  0.03475725,  0.00619523,
       -0.00148926, -0.0838009 , -0.03525992, -0.00176887,  0.05675203,
       -0.00406429, -0.03398255, -0.07559781,  0.01268123, -0.01378373,
        0.0135798 ,  0.03814962, -0.03108847, -0.01252668, -0.01202721,
       -0.04557919, -0.00031108, -0.04531989, -0.01134968,  0.01894855,
       -0.03747529, -0.01863389, -0.01729678, -0.03601868,  0.02076482,
       -0.03500062, -0.01360449,  0.01344   ,  0.00407904, -0.01

In [None]:
# TODO: Compute the std deviation of each row
mat.std(axis=1)

array([1.18101500e-02, 2.47960224e-02, 2.04238465e-02, 3.87339624e-02,
       2.08838763e-02, 2.59997687e-02, 8.45167502e-04, 3.26991907e-02,
       9.99388082e-03, 5.49812000e-02, 6.25979640e-03, 4.83618980e-03,
       2.25603417e-02, 3.63888311e-03, 3.54737774e-02, 1.73378339e-02,
       2.39050721e-02, 2.93259507e-02, 1.38450993e-02, 4.62394937e-03,
       3.87905261e-03, 3.15286328e-02, 4.06900355e-02, 7.63870555e-03,
       4.97319435e-02, 2.72974410e-02, 1.49420403e-02, 4.16089148e-02,
       3.69506917e-02, 3.66710841e-02, 5.22269225e-02, 2.09041855e-02,
       4.54218754e-02, 4.06035393e-02, 3.98053349e-02, 3.99845768e-02,
       5.12341729e-03, 1.00930662e-02, 3.66392659e-02, 8.07724590e-03,
       6.87231933e-03, 1.61596562e-02, 2.47427147e-02, 2.16820807e-02,
       1.14110478e-02, 3.12425856e-02, 2.23875394e-02, 2.56751064e-03,
       5.44549833e-02, 2.80562074e-02, 2.08635671e-02, 2.17215189e-02,
       2.16490822e-02, 3.37958469e-03, 3.70778062e-02, 3.73924880e-03,
      

### Exercise 8: Math Operations

In [None]:
# TODO: Calculate the dot product of the two original columns (you can think of these columns as vectors)
col_1 = mat[:, 0]
col_2 = mat[:, 1]

dot_product = np.dot(col_1, col_2)

In [None]:
# TODO: Calculate the magnitude of both columns
mag1 = np.linalg.norm(col_1)
mag2 = np.linalg.norm(col_2)

In [None]:
# TODO: Find the cosine similarity (dot product / product of magnitudes)
dot_product / (mag1 * mag2)

np.float64(0.18508466614655555)

The cosine similarity is used to determine how related two features are to each other. A high value indicates that these features are highly correlated which means that changes in one variable corresponds to changes in the other. This is called **collinearity**.

Using two or more highly correlated features in a machine learning algorithm can lead to numerous problems such as redundant information and reduced model reliability

### Exercise 9: Further Analysis

#### Continue exploring and gaining more insights into your data. Utilize the docs linked above.