# Load Pandas

In [32]:
import pandas as pd

# Load Data Set

- we'll load the housing price data to continue experimenting with Pandas

- [here](https://www.kaggle.com/c/home-data-for-ml-course/data) is the source of the data 
  - find the "Download All" button to download the entire data set



In [33]:
# read the csv from drive (google drive in this case)
data = pd.read_csv('/Users/musubimanagement/Documents/GitHub/data-science-course-wiki/common/housing-data-set/train.csv')
# add your own path above to read the train.csv file


In [34]:
# display the previously loaded DataFrame
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


# Discussion

- plucking the right data out of our data representation is critical to getting work done

- however, the data does not always come out of memory in the format we want it in right out of the bat

- sometimes we have to do some more work ourselves to reformat it for the task at hand

- this notebook covers some of those aspects

# Summary Functions

- pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way



### `.info()` function

- the `info()` function is used to print a concise summary of a DataFrame 
- this method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage

In [35]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### `.describe()` function 

- generates a high-level summary of the attributes of the given column

- it is type-aware, meaning that its output changes based on the data type of the input

In [None]:
# describe for a numerical column
data.LotArea.describe()

In [None]:
# describe for a string column
data.Neighborhood.describe()

- if you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen

- for example, to see the mean of the Lot Areas, we can use the `mean()` function

### `.mean()` function

- find the mean of a column

In [None]:
# find the mean of lot area column
data.LotArea.mean()

- to see a list of unique values we can use the `unique()` function

### `.unique()` function

- find the unique values in a column

In [None]:
# return array of unique values of neighborhood
data.Neighborhood.unique()

### `.value_counts()` function

- to see a list of unique values and how often they occur in the dataset, we can use the `value_counts()` method

In [None]:
# get the count of each unique entry of neighborhood column of data DataFrame
data.Neighborhood.value_counts()

# Maps

- **map** is a term borrowed from mathematics 
  - for a function that takes one set of values and "maps" them to another set of values
  
- in data science, we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later
  - maps are what handle this work, making them extremely important for getting your work done!

- there are two mapping methods that you will use often:
  - `.map()`
  - `.apply()`

- note that `map()` and `apply()` return new, transformed Series and DataFrames, respectively 
  - they don't modify the original data they're called on


### `.map()` function

- `.map()` is the slightly simpler one mapping function 



- suppose we wanted to *re-mean* the 'OverallQual' score to 0
  - we can do this as follows

In [None]:
# compute the mean of the column
data_overallqual_mean = data.OverallQual.mean()

# map column to set the score with respect to the mean
data.OverallQual.map(lambda curr_val: curr_val - data_overallqual_mean)

- the function you pass to `map()` should expect a single value from the Series (an overall quality score, in above example: curr_val), and return a transformed version of that value

- `map()` returns a new Series where all the values have been transformed by the function

### `.apply()` function

- is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row

In [36]:
# define a function to call on the entire row
def remean_lotarea(row):

    # compute the mean of the column
    data_overallqual_mean = data.OverallQual.mean()
    
    # get the re-meaned values 
    row.OverallQual = row.OverallQual - data_overallqual_mean

    # return the newly computed row
    return row

# use the function to generate a new DataFrame
data.apply(remean_lotarea, axis='columns')

# check the column 'OverallQual' to see the re-meaned (normalized) values

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


- if we had called `data.apply()` with `axis='index'`, then instead of passing a function to transform each row, we would need to give a function to transform each *column*.

# Quick Maps

- pandas provides many common mapping operations as built-ins

- here's a faster way of remeaning our points column:

In [None]:
# compute the mean of the column
data_overallqual_mean = data.OverallQual.mean()

# subtract the mean from the entire column
data.OverallQual - data_overallqual_mean

# Quick Combine

- pandas will also understand what to do if we perform these operations between Series of equal length
 
- for example, an easy way of combining BldgType and HouseStyle information in the dataset would be to do the following:

In [None]:
# create new series using quick combine method
data.BldgType + " - " + data.HouseStyle