# Starting the exploratory data analysis

So far we've seen 'pure' python. We'll deal now with tools that make data analysis easier.

![pandas](https://static.boredpanda.com/blog/wp-content/uploads/2014/05/funny-animals-doing-yoga-20.jpg)

## PANDAS 
It is the most popular library for importing and analyzing data.

---------

## open source <3

In [None]:
import pandas

### 1 - Let´s start by reading a CSV (Comma-separated values) with pandas

In [None]:
titanic = pandas.read_csv('/dbfs/FileStore/CDS2023/titanic.csv')

#You can also read other file formats
#pandas.read_   # press tab

![titanic](https://images.nationalgeographic.org/image/upload/v1638882458/EducationHub/photos/titanic-sinking.jpg)

In this case, the file really came separated by commas. If not, we could use the `sep` parameter

```
titanic = pandas.read_csv("titanic.csv", sep=",")
```

### 2 - Looking at some rows of the dataframe 

In [None]:
titanic.head()

In [None]:
titanic.head(10)

In [None]:
type(titanic)

#### A DataFrame represents data in tabular form, looking like an Excel table.
![](https://www.tutorialspoint.com/python_pandas/images/structure_table.jpg)

### What does each column mean?
Our dataset is based on public data, we can find a lot of information about Titanic and the description of the data in Kaggle.

https://www.kaggle.com/c/titanic/data

* survived: 
    - (0 = No; 1 = Yes)
* pclass: Passenger class 
     - (1 = First; 2 = Second; 3 = Third)
* name
* sex
* age
* sibsp: Number of siblings/spouses aboard the Titanic
* parch: Number of parents/children aboard the Titanic
* ticket: Ticket number
* fare: Passenger fare
* cabin: Cabin number
* embarked: Port of Embarkation
    * [C = Cherbourg (France); Q = Queenstown (Ireland); S = Southampton (England)]

### How can we return data from a single column?

In [None]:
titanic["name"].head()  # it looks like an dictionary of lists!

#### It's also possible to access the column with . "dot" 

In [None]:
titanic.name.head()

### The type of a column is called a series

In [None]:
type(titanic.name)

_A series is a list with one indexed dimension. In this case, it represents the columns and its types, respectively._

![](https://i0.wp.com/www.emilkhatib.com/wp-content/uploads/2015/10/SeriesEN.png)

### 3 - Information about the DataFrame

What is the dimension of the dataset (rows x columns)?

In [None]:
titanic.shape

When importing the CSV, the Dataframe includes an index column, which does not exist in the file. This column is like an Excel row index.

In [None]:
titanic.index

What are the columns of the Dataframe?

In [None]:
titanic.columns

What are the types of each column?

In [None]:
titanic.dtypes

What to do if we want just columns of a specific type?

In [None]:
filter_numerical_type = titanic.dtypes == "int64"
filter_numerical_type  

In [None]:
numerical_columms = titanic.dtypes[titanic.dtypes == "int64"].index
numerical_columms

#### How can we exclude columns of a specific type?

What if we want just the columns _name_, _sibsp_ e _parch_?

In [None]:
titanic[["name","sibsp","parch"]].head()

How can we return a DataFrame with all columns except the last one?

In [None]:
all_columns_except_last = titanic.columns[:-1]
all_columns_except_last

In [None]:
# Show the df result


### 4 - Statistical information about the DataFrame

In [None]:
# Statistics about the numerical data 
titanic.describe()

In [None]:
# Information about just one specific column
# titanic["age"].describe()
# ou
titanic.age.describe()

#### What if we want to see information about all columns at the same time? It'll be messy, but it's possible :p

In [None]:
titanic.describe(include="all")

What if we want just one value from the statistical information?

In [None]:
#Mean
titanic["age"].mean()

In [None]:
#Standard deviation
titanic.age.std()

### 5 - Filtering the DataFrame

Filtering rows in which Pclass == 1 (the richest)

In [None]:
filter_first_class = titanic["pclass"] == 1
filter_first_class.head()
#ou 
#titanic[ titanic.pclass == 1 ].head()

In [None]:
# titanic[titanic.Pclass == 1].head()
#ou
titanic[filter_first_class].head()

In [None]:
filter_first_class.sum()

Looking closer: the filter is always a set of booleans informing whether or not the index was included in the filter.

In [None]:
titanic["pclass"] == 1

The filter result is another dataframe, so we can assign this result to a variable and work with it without fear of changing the original dataframe. 

In [None]:
filter_first_class = titanic[titanic["pclass"] == 1]
type(filter_first_class)

# Challenge
Create a DataFrame with data of the second and third class and compare its statistical information with the DataFrame first_class. 

![alt text](https://comvoce.britania.com.br/wp-content/uploads/2021/07/003-Dica.jpg)

In [None]:
filter_2nd3rd = 
filter_2nd3rd

In [None]:
df_second_third = 
df_second_third.head()

In [None]:
df_second_third.describe()

__Filtering according to more than one condition__

Conditions: people who survived *AND* were from the fist class 

In [None]:
rich = titanic["pclass"] == 1
survived = titanic["survived"] == 1
rich_and_survived = rich & survived

Conditions: people from the first class *OR* the second class 

In [None]:
titanic[(titanic["pclass"] == 1) | (titanic["pclass"] == 2)].head()

Comparing the difference in average age before and after the filter

In [None]:
mean = titanic.age.mean()
mean_filter = titanic[rich_and_survived].age.mean()
print("""
Average age: {}
Average age of the survivors from the first class: {}""".format(
    mean,
    mean_filter)
)

### 6 - Data Visualization

__Types of graphs__
    
    - 'line' : line plot (default)
    - 'bar' : vertical bar plot
    - 'barh' : horizontal bar plot
    - 'hist' : histogram
    - 'box' : boxplot
    - 'kde' : Kernel Density Estimation plot
    - 'density' : same as 'kde'
    - 'area' : area plot
    - 'pie' : pie plot
    - 'scatter' : scatter plot
    - 'hexbin' : hexbin plot

__Histogram:__ graph that shows the frequency of the data, that is, how many times each value appears in the dataframe.

Let's use the histogram to compare the `Age` of those who survived and those who didn't.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
titanic['age'].hist()

In [None]:
survivors = titanic[titanic.survived == 1]
survivors['age'].hist()
plt.xlim((0, 80))

In [None]:
not_survivors = titanic[titanic.survived == 0]
not_survivors['age'].hist()
plt.xlim((0, 80))

### To plot both together, the secret is to put them in the same cell.

In [None]:
survivors['age'].hist(bins=20, alpha=0.5, label='Survived')
not_survivors['age'].hist(bins=20, alpha=0.5, label='Not survived')
plt.legend()

__Plot scatter__: shows the dispersion between 2 sets of data. In this case, we are looking at the relationship between the price paid and age.

In [None]:
titanic.plot.scatter(x = "age", y = "fare")

## 7 - Data transformation

Assuming we want to create a column where the value is the sum of `sibsp` and `parch`.

First, we need to define a function.

In [None]:
def sum_sibsp_parch(row):
    return row["sibsp"] + row["parch"]

Then we need to apply the function for every line of the DataFrame. **It is like a automatic `for`!**

In [None]:
new_column = titanic.apply(sum_sibsp_parch, axis=1)

`axis = 1` represents that the function is applied line by line.

`axis = 0`, which is the default value, represents that the function is applied column by column.

Let's see what it returns:

In [None]:
new_column.head()

In [None]:
type(new_column)

### How can we add this new column to our DataFrame?
It's similar to a python dictionary.

In [None]:
titanic["relatives"] = new_column

Let's verify that the column was created!

In [None]:
titanic.columns

In [None]:
titanic.head()

### And how can we save this new DataFrame?

In [None]:
#Replace the content inside brackets (including them) by your initials
titanic.to_csv("titanic_exercise_<writeYouInitialsHere>.csv", index=False)