_**Important Note**: This notebook is simply a series of notes I'm writing as I code along with the lecture videos. For a much more thorough explanation of what is discussed in the course lectures, check out the [official notebook](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-pandas.ipynb) for this section._

# Data Analysis using Pandas

[pandas](https://pandas.pydata.org/) is an open-source library which helps with data analysis and manipulation. It provides a simple to use yet powerful functions to handle various kinds of data. `pandas` is integrated with many other data science and machine learning Python tools, typically to transform the data into a format which is ready to be inputted to machine learning algorithms.

The very first thing we need to do when starting up the project is import the `pandas` library:

In [1]:
import pandas as pd

The above syntax simply "renames" `pandas` to `pd` so that we can have an easier time typing the whole name out when using `pandas`' various data analysis tools.

## Two Main Datatypes

`pandas` offers two data types: `Series` and `DataFrame`.

### Series

To create a new `Series` object, simply call the `pd.Series` function with the list of data to be stored in the `Series`:

In [2]:
series = pd.Series(["BMW", "Toyota", "Honda"])
type(series)

pandas.core.series.Series

In [3]:
series

0       BMW
1    Toyota
2     Honda
dtype: object

In [4]:
colours = pd.Series(["Red", "Blue", "White"])
colours

0      Red
1     Blue
2    White
dtype: object

One thing to note about `Series` is that it's what's known as a *one-dimensional* object, meaning the data can have multiple rows but only one column of information (aside from the index numbers).

### DataFrame

Unlike `Series`, `DataFrame` objects are known as a *two-dimensional* object, meaning it can have multiple columns of information. `DataFrame`s are much more commonly used than `Series` simply because most real-world data contain multiple features which means they require multiples of information which is exactly what `DataFrame`s are best suited for.

To create a new `DataFrame` object, use the `pd.DataFrame` function with a dictionary that contains the information to be stored in the `DataFrame`. You can even create a `DataFrame` out of multiple `Series` objects:

In [5]:
car_data = pd.DataFrame({"Car make": series, "Colour": colours})
type(car_data)

pandas.core.frame.DataFrame

In [6]:
car_data

Unnamed: 0,Car make,Colour
0,BMW,Red
1,Toyota,Blue
2,Honda,White


## Importing Data

You might start to notice that creating `Series` and `DataFrame` objects from scratch would get rather tedious especially when handling large amounts of data. Luckily, most real-world data are typically already neatly stored in separate files such as spreadsheets.

With `pandas`, you can easily use those spreadsheets and convert them into workable objects by importing the spreadsheet data. For example, the `pd.read_csv` function can import data from `.csv` spreadsheets:

In [7]:
car_sales = pd.read_csv("../data/car-sales.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


As you can see, importing data using `pd.read_csv` provides us with the spreadsheet data and stores it into a `DataFrame`. This means we can easily work with the imported data with the various data analysis tools that `pandas` has to offer!

## Anatomy of a DataFrame

![Anatomy of a DataFrame](./img/anatomy-of-a-dataframe.png)

## Exporting a DataFrame

You can also export `DataFrame`s into a separate file. The `to_csv` method exports a `DataFrame` into a `.csv` spreadsheet file. Alternatively, you can also use the `to_excel` method to export the `DataFrame` into an Excel spreadsheet instead.

In [8]:
car_sales.to_csv("exported_car_sales.csv")

And as usual, you can then import the file back into a `DataFrame`:

In [9]:
exported_car_sales = pd.read_csv("./exported_car_sales.csv")
exported_car_sales

Unnamed: 0.1,Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,0,Toyota,White,150043,4,"$4,000.00"
1,1,Honda,Red,87899,4,"$5,000.00"
2,2,Toyota,Blue,32549,3,"$7,000.00"
3,3,BMW,Black,11179,5,"$22,000.00"
4,4,Nissan,White,213095,4,"$3,500.00"
5,5,Toyota,Green,99213,4,"$4,500.00"
6,6,Honda,Blue,45698,4,"$7,500.00"
7,7,Honda,Blue,54738,4,"$7,000.00"
8,8,Toyota,White,60000,4,"$6,250.00"
9,9,Nissan,White,31600,4,"$9,700.00"


Huh, that's weird... It seems that another column was added when the object was exported! That extra column was actually the index column that got included in the `.csv` file when the data got exported. Since `DataFrames` automatically creates an index column, we can omit it when we export to a separate file:

In [10]:
car_sales.to_csv("exported_car_sales_no_index.csv", index=False)

Verifying that the exported file does not have an extra index column:

In [11]:
exported_car_sales_no_index = pd.read_csv("./exported_car_sales_no_index.csv")
exported_car_sales_no_index

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


You can also import data from external URLs:

In [12]:
heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
