# Day 10:  Data Processing

In [1]:
import pandas as pd
import numpy as np
from numpy import linalg
import sklearn as sk
from sklearn import preprocessing
import statsmodels.formula.api as smf
import math
from sklearn.model_selection import train_test_split

In [2]:
#This is your first pip!
!pip install factor_analyzer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
df = pd.read_csv('/content/sample_data/california_housing_test.csv')
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0


## Standardizing

One must first do things manually to understand the machine. These processes are easy to run manually, anyway. Functionalization here is a matter of good code, not because the process is too complex to remember.

In [4]:
mu = df['longitude'].mean()
sd = df['longitude'].std()

df['standardized_long'] =  (df['longitude'] - mu)/sd

df[['longitude','standardized_long']]

Unnamed: 0,longitude,standardized_long
0,-122.05,-1.233523
1,-118.30,0.646236
2,-117.81,0.891858
3,-118.36,0.616160
4,-119.67,-0.040503
...,...,...
2995,-119.86,-0.135744
2996,-118.14,0.726439
2997,-119.70,-0.055541
2998,-117.12,1.237734


## Normalizing

In [5]:
minL = df['longitude'].min()
maxL = df['longitude'].max()

df['normalized_long'] = ( df['longitude'] - minL ) / ( maxL-minL )

df[['longitude','normalized_long', 'standardized_long']]

Unnamed: 0,longitude,normalized_long,standardized_long
0,-122.05,0.219814,-1.233523
1,-118.30,0.606811,0.646236
2,-117.81,0.657379,0.891858
3,-118.36,0.600619,0.616160
4,-119.67,0.465428,-0.040503
...,...,...,...
2995,-119.86,0.445820,-0.135744
2996,-118.14,0.623323,0.726439
2997,-119.70,0.462332,-0.055541
2998,-117.12,0.728586,1.237734


# Dimentionality Reduction

One might recall multicolinearity as a major problem - one that becomes more common as data sets become wider, since it is more likely that a column contains extensive matching.   Length of data set (as opposed to width) is sometimes pitched as a way to reduce multicolinearity but this only reduces **incidental** multicolinearity, where two random variables *just so happen to be the same*. In practice, multicolinearity will most often arise from two variables actually being functionally redundant, for example the number of books sold by an author and the number of pages sold by that author.


One way to avoid multicolinarity is by reducing the number of columns -dimensionality reduction provides some guidence as to how to do it. The basic process is based off of the correlation between columns, columns that are tightly related have high correlations (and therefore would have high R^2 if we conduct a VIF test).

In [6]:
df.corr()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,standardized_long,normalized_long
longitude,1.0,-0.925017,-0.064203,0.049865,0.070869,0.111572,0.051062,-0.018701,-0.050662,1.0,1.0
latitude,-0.925017,1.0,-0.025143,-0.039632,-0.068245,-0.117318,-0.068296,-0.072363,-0.138428,-0.925017,-0.925017
housing_median_age,-0.064203,-0.025143,1.0,-0.36785,-0.323154,-0.299888,-0.305171,-0.144315,0.091409,-0.064203,-0.064203
total_rooms,0.049865,-0.039632,-0.36785,1.0,0.937749,0.838867,0.914116,0.221249,0.160427,0.049865,0.049865
total_bedrooms,0.070869,-0.068245,-0.323154,0.937749,1.0,0.856387,0.970758,0.024025,0.082279,0.070869,0.070869
population,0.111572,-0.117318,-0.299888,0.838867,0.856387,1.0,0.89553,0.032361,-0.001192,0.111572,0.111572
households,0.051062,-0.068296,-0.305171,0.914116,0.970758,0.89553,1.0,0.048625,0.100176,0.051062,0.051062
median_income,-0.018701,-0.072363,-0.144315,0.221249,0.024025,0.032361,0.048625,1.0,0.672695,-0.018701,-0.018701
median_house_value,-0.050662,-0.138428,0.091409,0.160427,0.082279,-0.001192,0.100176,0.672695,1.0,-0.050662,-0.050662
standardized_long,1.0,-0.925017,-0.064203,0.049865,0.070869,0.111572,0.051062,-0.018701,-0.050662,1.0,1.0


In [7]:
#Let's focus on the problematic areas
df.corr()[['longitude','normalized_long', 'standardized_long']]

Unnamed: 0,longitude,normalized_long,standardized_long
longitude,1.0,1.0,1.0
latitude,-0.925017,-0.925017,-0.925017
housing_median_age,-0.064203,-0.064203,-0.064203
total_rooms,0.049865,0.049865,0.049865
total_bedrooms,0.070869,0.070869,0.070869
population,0.111572,0.111572,0.111572
households,0.051062,0.051062,0.051062
median_income,-0.018701,-0.018701,-0.018701
median_house_value,-0.050662,-0.050662,-0.050662
standardized_long,1.0,1.0,1.0


In [8]:
#Let's focus on the problematic areas, note the double-subsetting is challenging. iloc starts index at 0.
df.corr()[['longitude', 'latitude','normalized_long', 'standardized_long']].iloc[[0,1,10,9],:]

Unnamed: 0,longitude,latitude,normalized_long,standardized_long
longitude,1.0,-0.925017,1.0,1.0
latitude,-0.925017,1.0,-0.925017,-0.925017
normalized_long,1.0,-0.925017,1.0,1.0
standardized_long,1.0,-0.925017,1.0,1.0


Now these are the problematic ones. The longitude and its normalized variants are almost perfectly correlated. Only one should be selected for further use, the selection should be done based on ease of interpretation and suitability to purpose (before viewing final results).

Amusingly, longitude and latitude are also almost perfectly correlated since the state of CA is a diagonal downward slash. It is certainly possible to cut latitude on the basis of being too correlated with longitude (-0.92), but this decision is almost certianly a case of being too judicious - there is some value to the latitude beyond the longitude.

# Categorical Variables

Some variables are not numeric, see below.  One might want to encode these in a functional way. 

In [9]:
import seaborn as sns

# loads the iris dataset
iris = sns.load_dataset("iris")

iris['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [10]:
#Why did I leave out the last one?  Note true converts to 1, false converts to 0.
iris['setosa'] = iris['species'] == 'setosa'
iris['versicolor'] = iris['species'] == 'versicolor'


iris[['species', 'setosa','versicolor']]

Unnamed: 0,species,setosa,versicolor
0,setosa,True,False
1,setosa,True,False
2,setosa,True,False
3,setosa,True,False
4,setosa,True,False
...,...,...,...
145,virginica,False,False
146,virginica,False,False
147,virginica,False,False
148,virginica,False,False


# Data Partitions

Frequently, one wants to prepare a model for deployment in the real world. However, given a data set DF, it is not clear how to prepare it for some real data that has never arrived - the real world data is always in the future, outside of the computer.

Data partitions help represent this so that we can confirm that our plans work on "new data" that we have left out of our training process.

**Training data** is the one we do our initial work on.
**Validation data** is kept seperate to confirm our model from the training data functions.
**Test data** is another (optional) subset to confirm that our training & validation steps have worked.

We then test it on **new data** from the real world.

In [11]:
train, test = train_test_split(df, test_size = 0.3) #splits data 70/30 at random. 

In [12]:
train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,standardized_long,normalized_long
630,-117.89,34.49,12.0,3449.0,598.0,1502.0,540.0,3.7043,150800.0,0.851757,0.649123
2793,-120.27,39.35,11.0,2520.0,401.0,397.0,165.0,4.665,145600.0,-0.341264,0.403509
972,-121.55,40.48,14.0,2413.0,524.0,805.0,329.0,2.7857,77400.0,-0.982889,0.271414
2302,-117.83,33.79,29.0,1454.0,236.0,724.0,262.0,4.8542,218100.0,0.881833,0.655315
960,-120.87,37.76,16.0,2022.0,413.0,1126.0,408.0,2.5655,116400.0,-0.642026,0.341589


In [13]:
test.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,standardized_long,normalized_long
411,-118.19,34.14,38.0,1826.0,300.0,793.0,297.0,5.2962,291500.0,0.701376,0.618163
252,-117.02,32.69,7.0,6055.0,1004.0,3031.0,952.0,4.436,135000.0,1.287861,0.738906
1949,-121.32,38.66,26.0,1149.0,193.0,500.0,194.0,5.078,163400.0,-0.867597,0.29515
589,-117.27,32.84,34.0,1655.0,450.0,870.0,411.0,3.2109,376000.0,1.162543,0.713106
132,-122.42,37.73,50.0,3426.0,769.0,2261.0,671.0,2.888,246400.0,-1.418993,0.181631
