# Exploratory Data Analysis - Basics

This notebook references the following tutorials
*  Alex The Analyst's YouTube [Exploratory Data Analysis in Pandas | Python Pandas Tutorials](https://youtu.be/Liv6eeb1VfE?si=tCr3Yt-PZirW3Av0)
*  Rob Mulla's YouTube [Exploratory Data Analysis with Pandas Python 2023](https://youtu.be/xi0vhXFPegw?si=oZ4_8raYZtdF9OEo)
*  Kaggle's Tutorial [Your First Machine Learning Model](https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model/tutorial)
*  ChatGPT [query it to make your own EDA tutorial](https://chat.openai.com/auth/login)

We are going to accomplish the following tasks in this notebook
*  Import the dataset into google colab from the desktop
*  Import data analysis libraries
*  Display the number of columns and rows in the dataset
*  Display the first 5 rows of the dataset
*  Display the first 10 rows of the dataset
*  Display descriptitve statistics for the dataset

We can drag and drop data files (csv files) that we want to work with from our local drive into the google colab file icon (left side of the colab screen)
*  Download the [Kaggle train.csv dataset](https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model/data) to your desktop
*  Click on the folder on left side of the approximate middle of the Colab screen
*  Drag and drop the train.csv file into the folder to upload it to Google Colab from your desktop
*  You will need to do this operation everytime you use the notebook

Our first script loads the following libraries:
*  [Pandas](https://pandas.pydata.org/) version 2.14, dated 8 December 2023
*  [Seaborn](https://seaborn.pydata.org/) version v 0.13.1, dated 2021
*  [Matplotlib](https://matplotlib.org/) version 3.8.2, dated 17 Nov, 2023

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Our next script:
*  Uses the Pandas library
*  Loads the data
*  Uses the [.shape method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)
*  Return a tuple representing the dimensionality (rows, columns) of the DataFrame

In [None]:
# Load data
home_data = pd.read_csv('train.csv')

# Display the number of rows and columns comprising the dataset
home_data.shape

(1460, 81)

This script:
*  Uses the Pandas library
*  Uses the [.head method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)
*  Displays the first n rows of the dataset
*  n = 5 is the default

In [None]:
# Display the first five rows of the dataset
home_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Our next script:
*  Uses the Pandas library
*  Uses the [.head method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)
*  Returns the first n rows
*  Can specify n = 10, etc.


In [None]:
# Display the first 10 rows of the dataset
home_data.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


Our next script:
*  Uses the Pandas library
*  Uses the [.describe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
*  Generate descriptive statistics.
*  Descriptive statistics include those that summarize the central tendency, dispersion and shape of a datasets distribution, excluding NaN values



In [None]:
# Display descriptive statistics about the dataset
home_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0
