# Data analysis from zero to hero

## Imports section

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import os

## First step: load your data set

For analyze a bunch of data you have, of course, to load it in the right way.

Depending on the type of data you are gonna analyze, there are various ways to import the related DataFrame.
For instance:
- CSV: pd.read_csv("file/path/to/data.csv", sep=";")
- Excel: pd.read_excel("file/path/to/data.xlsx")
- JSON: pd.read_json("file/path/to/data.json")
- HTML: pd.read_html("file/path/to/data.html")

In [None]:
path = os.path.join("sample_data", "housing.csv")
df = pd.read_csv(path)

## Exploratory Data Analysis

First of all, you have to look arond your data, in order to understand the basic features of the data frame

Adviced steps are the following one:
- Visualize the first 5 and the last 5 rows using df.head() and df.tail() respectively
- Have a look at the shape of the data set, using df.shape
- Visualize the available columns using df.columns
- Check data types using df.dtypes

For a brief look at the statistics and null values:
- df.describe() and df.describe().T
- df.info()
- df.isna().sum().sum() for overall count, only one sum for count by column

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df.describe().T

In [None]:
df.info()

## Manage null values

It can happen to have null values in columns

Depending on their importance in your analysis you can either drop them or fill them with a suitable value (e.g the column mean)

Note: this type of operation should be done on a copy of the original dataset, better safe than sorry

Note 2: the use of inPlace=True on some operations has been marked with a FutureWarning, so it is better to avoid it in favour of assignment. Other reasons to do it are clarity, readibility, and intention driven behaviour

In [None]:
# df['column_with_null_values'] = df['column_with_null_values'].dropna()
# df['column_with_null_values'] = df['column_with_null_values'].fillna(df['column_with_null_values'].mean())

## Filtering data

You may want to visualize or treat only a slice of the original Pandas series (column).
It can be done using boolean masks, their syntax is quite simple, but they are very useful sometimes

In [None]:
# boolean_mask = series_name > value
# boolean_mask_and = (series_name > value) & (series_name < value)
# boolean_mask_or = (series_name < value) || (series_name == value)

Another useful feature could be column renaming, in order to clarify what does that series contain. Or, more trivial, you may want to capitalize or format the name somehow

In [None]:
# df.columns = df.columns.map(lamda col: col.capitalize())
# df.columns = df.columns.map(str.capitalize)

## Plotting

Visualize the data you are working with in order to extract hidden features

In [None]:
df.hist(figsize=(10, 10), bins=50)

### Simple plot

In [None]:
x = list(range(0, 21))
y = [i**2 for i in x]

plt.figure(figsize=(10,6))
plt.plot(x, y)
plt.title('Square of X')
plt.xlabel('X')
plt.ylabel('Y')
plt.xticks(x)
plt.show()

### Scatter

c indicates the color to give at the scatter chart

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(x, y, c=x)

### Bar Chart

In [None]:
labels = ['MS', 'Apple', 'Meta', 'Google', 'Amazon']
stock_prices = [130, 112, 145, 180, 201]

plt.figure(figsize=(10, 4))
bars = plt.bar(labels, stock_prices)
bars[2].set_hatch('.')
bars[0].set_width(0.2)
bars[1].set_color('red')

### Heat map

In [None]:
plt.figure(figsize=(10, 10))
sb.heatmap(df.corr(), annot=True)

## Data Wrangling

This step is needed to remove outliers in order to have a smoother data set

The procedure here usually consists in replacing the outliers with the mean value

## Aggregations and grouping

Like an SQL statement, it may be necessary to group data by a column in order to analyze data by categories or groups, such as calculating summary statistics like the mean, sum, or count for each group

In [None]:
# grouped_df = df.groupby('COL_NAME').agg({'COL_1': fun1, 'COL_2': fun2})