# Load Pandas

In [None]:
import pandas as pd

# Load Data Set

- we'll load the housing price data to continue experimenting with Pandas

- [here](https://www.kaggle.com/c/home-data-for-ml-course/data) is the source of the data 
  - find the "Download All" button to download the entire data set



In [None]:
# read the csv from drive (google drive in this case)
data = pd.read_csv('/content/drive/My Drive/Datasets/home-data-for-ml-course/train.csv')
# add your own path above to read the train.csv file


In [None]:
# display the previously loaded DataFrame
data

- in this notebook, you'll learn how to investigate data types within a DataFrame or Series. 

- you'll also learn how to find and replace entries

# Dtypes

- the data type for a column in a DataFrame or a Series is known as the dtype

- you can use the dtype property to grab the type of a specific column

In [None]:
# grab the data type of the SalePrice column
data.SalePrice.dtype

- a DataFrame or Series index has its own `dtype` too:

In [None]:
# check data type of index column
data.index.dtype

- the `dtypes` property returns the dtype of every column in the DataFrame:

In [None]:
# get the data type of every column in the data DataFrame
data.dtypes

- Data types tell us something about how pandas is storing the data internally
  - `float64` means that it's using a 64-bit floating point number 
  - `int64` means a similarly sized integer instead, and so on.

- one peculiarity to keep in mind is that columns consisting entirely of strings do not get their own type
  - they are instead given the `object` type.

### Type Conversion

- It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype() function.

In [None]:
# convert the SalePrice column to float64 from int64
data.SalePrice.astype('float')

# Missing Values

- entries missing values are given the value NaN, 
  - short for "Not a Number"
  - these NaN values are always of the `float64` dtype

- Pandas provides some methods specific to missing data 
  - to select `NaN` entries you can use `pd.isnull()` 
  - or its companion `pd.notnull()`to select non-null values


In [None]:
# select all the entries which have null values for Fence
data[pd.isnull(data.Fence)]


### replacing missing values 

- replacing missing values is a common operation
- Pandas provides a really handy method for this problem: `fillna()` 
  - `fillna()` provides a few different strategies for mitigating missing values

In [None]:
# filling missing fence values with 'no fence' 
data.Fence.fillna('No Fence')

### replacing non-null data 

- we may have a non-null value that we would like to replace

- it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.

- for example here: lets say we want to replace the NoRidge neighborhood with NorthRidge

In [None]:
data.Neighborhood.replace("NoRidge","NorthRidge")

# Renaming 

- oftentimes, data will come to us with column names, index names, or other naming conventions that we are not satisfied with

- in that case, there are pandas functions to change the names of the offending entries to something better

### Column Rename

In [None]:
# change the Neighborhood column name to Locality
data.rename(columns={'Neighborhood':'Locality'})

### Row Rename

- `rename()` lets you rename index or column values by specifying a index or column keyword parameter respectively

- it supports a variety of input formats, but usually a Python dictionary is the most convenient 

- here is an example using it to rename some elements of the index

In [None]:
data.rename(index={0:"firstEntry",1:"secondEntry"})

##### `.set_index()`

- you'll most likely rename columns very often, but rename index values very rarely
  - for renaming index values, `set_index()` is usually more convenient

In [None]:
# set the Id column as row labels
data.set_index('Id')

### `.rename_axis()`

- both the row index and the column index can have their own name attribute 
 
- the complimentary `rename_axis()` method may be used to change these names 

In [None]:
# set the label for rows as Houses and columns as Details
data.rename_axis("Houses",axis="rows").rename_axis("Details", axis="columns")

# Combining 

- when performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways 

- pandas has three core methods for doing this; in order of increasing complexity, they are
  - `concat()`, 
  - `join()`, and 
  - `merge()`
  
- most of what `merge()` can do can also be done more simply with `join()`

### `concat()`

- the simplest combining method is `concat()` 

- given a list of elements, this function will smush those elements together along an axis

- this is useful when we have data in different DataFrame or Series objects but having the same fields (columns).

##### Combining Series

In [None]:
s1 = pd.Series(['a', 'b']) # define series 1 
s2 = pd.Series(['c', 'd']) # define series 2
pd.concat([s1, s2]) # combine series 1 and 2

In [None]:
# clear the existing index and reset it in the result
pd.concat([s1, s2], ignore_index=True)

##### Combining DataFrames



In [None]:
# define DataFrame 1 
df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])

# define DataFrame 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])

# combine the two along column names, also reset indexes  
pd.concat([df1, df2], ignore_index=True)

In [None]:
# define 3rd DataFrame 
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])

# combine everything along overlapping columns  
pd.concat([df1,df3],sort=False,ignore_index=True)
# unknown values are filled with NaNs

In [None]:
# Combine DataFrames with overlapping columns and return only those that are shared
pd.concat([df1, df3], join="inner",ignore_index=True)

### `.join()`

- join columns of another DataFrame.

- join columns with other DataFrame either on index or on a key column

- efficiently join multiple DataFrame objects by index at once by passing a list

In [None]:
# init one DataFrame
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

df

In [None]:
# init another DataFrame

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})

other

In [None]:
# join `df` with `other` using appropriate suffixes on `key` column of each DataFrame
df.join(other, lsuffix='_caller', rsuffix='_other')

In [None]:
# to join along the key column, i.e. key as the index of the joined DataFrame:
df.set_index('key').join(other.set_index('key'))

### `.merge()`

- merge DataFrame or named Series objects with a database-style join.

- the join is done on columns or indexes 
  - if joining columns on columns, the DataFrame indexes will be ignored
  - otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on

In [None]:
# define one DataFrame 
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})

df1

In [None]:
# define one DataFrame 
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

df2

In [None]:
# merge df1 and df2 on the lkey and rkey columns 
# value columns have the default suffixes, _x and _y, appended
df1.merge(df2, left_on='lkey', right_on='rkey')

# Saving Data Files

- Series and DataFrames can be written to files
  - most popular file format written to `.csv`

In [None]:
# create and write DataFrame to CSV file

df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
                   'mask': ['red', 'purple'],
                   'weapon': ['sai', 'bo staff']})
# check dataframe 
df

In [None]:
# ouptut data frame to save file names `output.csv`
df.to_csv('output.csv',index=False) 
# check working directory for 