# **Lesson_4.1**

## In this lecture

* Fork repository, update or recreate codespace

* Pandas: data handling (introduction and essentials)
* In-class exercise

---

## Pandas Series and DataFrame*

<p align="center">
	<img src="../assets/img/pandas_series_dataframe.jpg" width="700">
</p>

#### In Data Science, Pandas handles data in Series and DataFrames.

* A Pandas Series is similar to a NumPy array: A grid of values that contain data (numerical, categorical, date, boolean etc.)
* A key difference is that a Series can be indexed by a label, either number or category.

#### There are multiple ways to create a **Pandas Series**

You can create a Pandas Series using `pd.Series()` based on a Python list. The argument is dataset

In [None]:
import numpy as np
import pandas as pd

In [None]:
categories = ['a', 'b', 'c']
my_list = [11, 22, 33]

In [None]:
ser = pd.Series(data = my_list, index = categories)
ser

#### Pandas **DataFrame**

**DataFrames** are the basis of Pandas.

* A dataframe is a set of Series put together to share the same index.
* Think of DataFrames as an Excel, or data table in JavaScript
* Typically the nomenclature for a Pandas DataFrame is **df**
* To create a DataFrame, use `pd.DataFrame()`. Consider the DataFrame below created based on the following arguments:
	* data are an array.
	* index is a list A to E
	* columns a list Col1 to Col4


In [None]:
np.random.seed(42)
df = pd.DataFrame(data=np.random.randn(5,4),
                  index=['A', 'B', 'C', 'D', 'E'],
                  columns=['Col1', 'Col2', 'Col3', 'Col4']
                  )

In [None]:
df

### How to Filter, Add and Drop DataFrame Columns

In [None]:
df['Col1']

In [None]:
df[['Col1', 'Col2']]
     

In [None]:
df.filter(['Col1'])

In [None]:
df1 = df.copy()

In [None]:
df1

In [None]:
df1['Col5'] = df['Col1']

In [None]:
df1

In [None]:
df1.drop(labels=['A', 'D'], axis=0)

In [None]:
df1

In [None]:
df1.drop(labels=['A', 'D'], axis=0, inplace=True)

In [None]:
df1

---

### `.groupby()`

In [None]:
data = {'Product':['Bread','Bread','Milk','Milk','Milk','Butter','Butter','Butter'],
        'Person':['Anna','Anna','Brian','John','John','Carl','Sarah','Anna'],
        'Sales':[200,120,340,124,243,350,500,240],
        'Quantity':[3,5,3,8,2,7,5,4],
        'Margin':[100,20,280,50,100,67,300,200]}

df = pd.DataFrame(data)
df

In [None]:
df.index

In [None]:
# df.groupby(['Product'])
df.groupby(by=['Product'])

In [None]:
by_group = df.groupby(by=['Product'])['Margin'].mean()
by_group.values

---

## Index and Values

In [None]:
by_group.index

In [None]:
by_group.values

* The section is based on the Predictive Analytics module by the [Code Institute](https://codeinstitute.net/)

---

## In-class exercise: Pandas + Matplotlib

**Objective**:
* Import pandas

* Load a real CSV dataset
* Inspect the dataset
* Check and drop missing values (use `.dropna()` Pandas method)
* Group and summarise data accordingly
* Plot a vertical bar for selected categories chart using Matplotlib
* Bonus: plot same vertical bar using Plotly Express (you will need to run `reset_index()` method)

Choose ONE of the following common datasets:

[Iris dataset](https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv)

[Tips dataset](https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv)

[Titanic dataset](https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv)

[Penguins dataset](https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv)


---

### End of lesson routine