# Ahmad M. Osman - Dr. Lee, DS320

## Assignment Instructions

Follow the Pandas Cookbook textbook Chapter 2 but use this dataset for your data on housing or one of your choosing. Comment on the dataset and what you find in it. You must write the comments in your Jupyter Notebook to describe what you are doing at each step. Upload your notebook here when it is complete. Work through pages 52 - 87 and make sure you try examples with this new dataset. 

You can get some information about the dataset (what is called meta-data) here.

You must include at least two interesting things you found in the data as you worked through the information from chapter 2 of the Pandas Cookbook.

# Interesting Things I Found After Finishing the Assignment

* The lot area does not have a strong correlation with the sale price, however, the garage area has a moderate correlation of 0.623 to the sale price.
* The data is focused on houses with an average price of 180K, with the highest price bein 800K, which makes me think that these houses are located in a suburb town/area.

# Chapter 2: Essential DataFrame Operations

## Recipes
* [Selecting multiple DataFrame columns](#Selecting-multiple-DataFrame-columns)
* [Selecting columns with methods](#Selecting-columns-with-methods)
* [Ordering column names sensibly](#Ordering-column-names-sensibly)
* [Operating on the entire DataFrame](#Operating-on-the-entire-DataFrame)
* [Chaining DataFrame methods together](#Chaining-DataFrame-methods-together)
* [Working with operators on a DataFrame](#Working-with-operators-on-a-DataFrame)
* [Comparing missing values](#Comparing-missing-values)
* [Transposing the direction of a DataFrame operation](#Transposing-the-direction-of-a-DataFrame-operation)
* [MISC](#MISC)

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 40

# Selecting multiple DataFrame columns

In [3]:
housing = pd.read_csv('train.csv')

In [4]:
# specifying which columns to show - focusing on the lot
lotCols = ["LotArea", "LotShape", "LotConfig", "LandContour", "Neighborhood"]
lotHousing = housing[lotCols]

lotHousing.head()

Unnamed: 0,LotArea,LotShape,LotConfig,LandContour,Neighborhood
0,8450,Reg,Inside,Lvl,CollgCr
1,9600,Reg,FR2,Lvl,Veenker
2,11250,IR1,Inside,Lvl,CollgCr
3,9550,IR1,Corner,Lvl,Crawfor
4,14260,IR1,FR2,Lvl,NoRidge


# Selecting columns with methods

In [7]:
# how many data types in our data frame
housing.get_dtype_counts()

float64     3
int64      35
object     43
dtype: int64

In [9]:
# lets get all columns with the int data type
housing.select_dtypes(include=['int']).head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,8450,7,5,2003,2003,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2,548,0,61,0,0,0,0,0,2,2008,208500
1,2,20,9600,6,8,1976,1976,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,2,460,298,0,0,0,0,0,0,5,2007,181500
2,3,60,11250,7,5,2001,2002,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2,608,0,42,0,0,0,0,0,9,2008,223500
3,4,70,9550,7,5,1915,1970,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,3,642,0,35,272,0,0,0,0,2,2006,140000
4,5,60,14260,8,5,2000,2000,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,3,836,192,84,0,0,0,0,0,12,2008,250000


In [10]:
# alll the numerical columns
housing.select_dtypes(include=['number']).head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008,208500
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007,181500
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008,223500
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006,140000
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008,250000


In [11]:
# anything that has the word "Lot"
housing.filter(like='Lot').head()

Unnamed: 0,LotFrontage,LotArea,LotShape,LotConfig
0,65.0,8450,Reg,Inside
1,80.0,9600,Reg,FR2
2,68.0,11250,IR1,Inside
3,60.0,9550,IR1,Corner
4,84.0,14260,IR1,FR2


In [12]:
# filtering based on regex, any instance of a digit in a column name will be projected
movie.filter(regex='\d').head()

Unnamed: 0,Condition1,Condition2,Exterior1st,Exterior2nd,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,1stFlrSF,2ndFlrSF,3SsnPorch
0,Norm,Norm,VinylSd,VinylSd,GLQ,706,Unf,0,856,854,0
1,Feedr,Norm,MetalSd,MetalSd,ALQ,978,Unf,0,1262,0,0
2,Norm,Norm,VinylSd,VinylSd,GLQ,486,Unf,0,920,866,0
3,Norm,Norm,Wd Sdng,Wd Shng,ALQ,216,Unf,0,961,756,0
4,Norm,Norm,VinylSd,VinylSd,GLQ,655,Unf,0,1145,1053,0


# Ordering column names sensibly

In [13]:
housing.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [14]:
# grouping colums together into categories

quality = ["OverallQual", "OverallCond", "BsmtCond", "GarageCond", "GarageQual"]
sale = ["MoSold", "YrSold", "SaleType", "SaleCondition", "SalePrice"]
garage = ["GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea"]
bsmt = ["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF"]

In [21]:
# lets arrange the columns order
new_col_order = quality + sale + garage + bsmt

# they are not alike
set(housing.columns) == set(new_order)

False

In [23]:
# looks way nicer now!
housing2 = housing[new_col_order]
housing2.head()

Unnamed: 0,OverallQual,OverallCond,BsmtCond,GarageCond,GarageQual,MoSold,YrSold,SaleType,SaleCondition,SalePrice,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,BsmtQual,BsmtCond.1,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF
0,7,5,TA,TA,TA,2,2008,WD,Normal,208500,Attchd,2003.0,RFn,2,548,Gd,TA,No,GLQ,706,Unf,0,150,856
1,6,8,TA,TA,TA,5,2007,WD,Normal,181500,Attchd,1976.0,RFn,2,460,Gd,TA,Gd,ALQ,978,Unf,0,284,1262
2,7,5,TA,TA,TA,9,2008,WD,Normal,223500,Attchd,2001.0,RFn,2,608,Gd,TA,Mn,GLQ,486,Unf,0,434,920
3,7,5,Gd,TA,TA,2,2006,WD,Abnorml,140000,Detchd,1998.0,Unf,3,642,TA,Gd,No,ALQ,216,Unf,0,540,756
4,8,5,TA,TA,TA,12,2008,WD,Normal,250000,Attchd,2000.0,RFn,3,836,Gd,TA,Av,GLQ,655,Unf,0,490,1145


# Operating on the entire DataFrame

In [24]:
# looking at the df dimensions
pd.options.display.max_rows = 8
housing.shape

(1460, 81)

In [25]:
housing.size

118260

In [26]:
housing.ndim

2

In [27]:
len(housing)

1460

In [28]:
housing.count()

Id               1460
MSSubClass       1460
MSZoning         1460
LotFrontage      1201
                 ... 
YrSold           1460
SaleType         1460
SaleCondition    1460
SalePrice        1460
Length: 81, dtype: int64

In [30]:
# Looking at min values for columns
housing.min()

Id                     1
MSSubClass            20
MSZoning         C (all)
LotFrontage           21
                  ...   
YrSold              2006
SaleType             COD
SaleCondition    Abnorml
SalePrice          34900
Length: 65, dtype: object

In [31]:
# statistical descriptions
housing.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [32]:
pd.options.display.max_rows = 10

In [34]:
# more statistical descriptions
housing.describe(percentiles=[.01, .3, .99])

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
1%,15.59,20.0,21.0,1680.0,3.0,3.0,1899.18,1950.0,0.0,0.0,0.0,0.0,0.0,520.0,0.0,0.0,692.18,0.0,0.0,1.0,0.0,1.0,1.0,3.0,0.0,1916.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,61815.97
30%,438.7,20.0,60.0,8063.7,5.0,5.0,1958.0,1971.0,0.0,0.0,0.0,280.0,840.0,915.7,0.0,0.0,1208.0,0.0,0.0,1.0,0.0,3.0,1.0,6.0,0.0,1965.0,1.0,384.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,135500.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
99%,1445.41,190.0,141.0,37567.64,10.0,9.0,2009.0,2009.0,791.92,1572.41,830.38,1797.05,2155.05,2219.46,1418.92,360.0,3123.48,2.0,1.0,3.0,1.0,5.0,2.0,11.0,2.0,2009.0,3.0,1002.79,505.46,285.82,261.05,168.0,268.05,0.0,700.0,12.0,2010.0,442567.01
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [35]:
pd.options.display.max_rows = 8

In [37]:
# where do we have null entries?! And how many of them?
housing.isnull().sum()

# a lot in the LotFrontage I guess...

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
                ... 
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

## There's more...

In [39]:
# do we have NaN values?!
housing.min(skipna=False)

Id                     1
MSSubClass            20
MSZoning         C (all)
LotFrontage          NaN
                  ...   
YrSold              2006
SaleType             COD
SaleCondition    Abnorml
SalePrice          34900
Length: 65, dtype: object

# Chaining DataFrame methods together

In [41]:
# figuring out nulls continued
housing.isnull().head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,...,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False,False


In [42]:
housing.isnull().sum().head()

Id               0
MSSubClass       0
MSZoning         0
LotFrontage    259
LotArea          0
dtype: int64

In [43]:
housing.isnull().sum().sum()

6965

In [45]:
# do we have any? Oh yeah!
housing.isnull().any().any()

True

## How it works...

In [46]:
# null values figuring continued
housing.isnull().get_dtype_counts()

bool    81
dtype: int64

## There's more...

In [47]:
# lets fix those?
housing.select_dtypes(["object"]).fillna("").min()

MSZoning         C (all)
Street              Grvl
Alley                   
LotShape             IR1
                  ...   
Fence                   
MiscFeature             
SaleType             COD
SaleCondition    Abnorml
Length: 43, dtype: object

In [48]:
housing.select_dtypes(["object"]).fillna("").max()

MSZoning              RM
Street              Pave
Alley               Pave
LotShape             Reg
                  ...   
Fence               MnWw
MiscFeature         TenC
SaleType              WD
SaleCondition    Partial
Length: 43, dtype: object

# Working with operators on a DataFrame

## Getting ready...

In [49]:
# lets play with the numbers
garage_year = housing["GarageYrBlt"]
garage_year

0       2003.0
1       1976.0
2       2001.0
3       1998.0
         ...  
1456    1978.0
1457    1941.0
1458    1950.0
1459    1965.0
Name: GarageYrBlt, Length: 1460, dtype: float64

In [52]:
(garage_year.head() + .00501) // .01

0    200300.0
1    197600.0
2    200100.0
3    199800.0
4    200000.0
Name: GarageYrBlt, dtype: float64

In [57]:
# some rounding?!
# the numbers are not helping showing the effect, but it is implicitly
# understood
garage_year_round = (garage_year + .00501) // .01 / 100
garage_year_round.head()

0    2003.0
1    1976.0
2    2001.0
3    1998.0
4    2000.0
Name: GarageYrBlt, dtype: float64

In [54]:
garage_year_round = (garage_year + .00001).round(2)
garage_year_round.head()

0    2003.0
1    1976.0
2    2001.0
3    1998.0
4    2000.0
Name: GarageYrBlt, dtype: float64

In [55]:
.045 + .005

0.049999999999999996

In [56]:
garage_year_round.equals(garage_year)

True

# Comparing missing values

In [58]:
# looking at NaN 
np.nan == np.nan

False

In [59]:
None == None

True

In [60]:
5 > np.nan

False

In [61]:
np.nan > 5

False

In [62]:
5 != np.nan

True

In [66]:
# comparisons
garage_year.head() == 1990

0    False
1    False
2    False
3    False
4    False
Name: GarageYrBlt, dtype: bool

In [67]:
garage_year.isnull().sum()

81

In [68]:
garage_year_round == garage_year

0       True
1       True
2       True
3       True
        ... 
1456    True
1457    True
1458    True
1459    True
Name: GarageYrBlt, Length: 1460, dtype: bool

# Transposing the direction of a DataFrame operation

In [72]:
# counting colums
# this is not perfect for our data...
housing.count(axis="columns").head()

0    76
1    77
2    77
3    77
4    77
dtype: int64

In [74]:
# summing them
housing.sum(axis="columns").head()

0    231062.0
1    204978.0
2    249155.0
3    163499.0
4    280612.0
dtype: float64

In [75]:
sale_price = housing["SalePrice"]

In [77]:
# is the sale price greater than or equal to
sale_price.ge(100000)

0       True
1       True
2       True
3       True
        ... 
1456    True
1457    True
1458    True
1459    True
Name: SalePrice, Length: 1460, dtype: bool

In [79]:
# they're getting expensive, you know...
# how many is there that the sale price of was greater than 450K?
sale_price.ge(450000).sum()

14

## There's more

In [84]:
sale_price.ge(250000).sum()

225

In [87]:
# Now lets see the highest ten
housing.max(axis=1).sort_values(ascending=False).head(10)

691     755000.0
1182    745000.0
1169    625000.0
898     611657.0
          ...   
440     555000.0
769     538000.0
178     501837.0
798     485000.0
Length: 10, dtype: float64

# MISC

In [72]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')
college_ugds_.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [90]:
# statistical description
sale_price.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

In [94]:
# looking at correlations between Sale Price and other variables
housing["SalePrice"].corr(housing["LotArea"])

0.26384335387140573

In [95]:
# this seems like a huge factor! 
housing["SalePrice"].corr(housing["GarageArea"])

0.62343143891836172

In [96]:
housing["SalePrice"].corr(housing["LotFrontage"])

0.35179909657067809

In [97]:
housing["SalePrice"].corr(housing["YrSold"])

-0.028922585168730339

In [98]:
housing["SalePrice"].corr(housing["MoSold"])

0.046432245223819391

# Interesting Things I Found After Finishing the Assignment

* The lot area does not have a strong correlation with the sale price, however, the garage area has a moderate correlation of 0.623 to the sale price.
* The data is focused on houses with an average price of 180K, with the highest price bein 800K, which makes me think that these houses are located in a suburb town/area.