># Contents
- [Import Pandas](#import-pandas)
- [Pandas Version](#checking-pandas-version)
- [Pandas Series](#pandas-series)
- [Labels](#labels)
  - [Create Labels](#create-labels)
- [Key/Value Objects as Series](#keyvalue-objects-as-series)
- [DataFrames](#dataframes)
  - [What is DataFrames](#what-is-a-dataframe)
  - [Locate Row](#locate-row)
  - [Named Indexes](#named-indexes)
  - [Locate Named Indexes](#locate-named-indexes)
  - [Load Files Into a DataFrame](#load-files-into-a-dataframe)
- [Read CSV Files](#read-csv-files)
  - [max_rows](#max_rows)
- [Pandas Read JSON](#pandas-read-json)
  - [Read JSON](#read-json)
  - [Dictionary as JSON](#dictionary-as-json)
- [Pandas - Analyzing DataFrames](#pandas---analyzing-dataframes)
  - [Viewing the Data](#viewing-the-data)
  - [Info About the Data](#info-about-the-data)

# Import Pandas

In [1]:
import pandas as pd

---

# Checking Pandas Version

In [6]:
print(pd.__version__)

1.4.4


---

In [4]:
mydata = {
    "cars":["Toyota","Honda","Volvo"],
    "passings":[2,6,5]}
print(pd.DataFrame(mydata))

     cars  passings
0  Toyota         2
1   Honda         6
2   Volvo         5


---

# Pandas Series
- A Pandas Series is like a column in a table.
- It is a one-dimensional array holding data of any type.

In [8]:
a = [1,3,5,7,9]
pd.Series(a)

0    1
1    3
2    5
3    7
4    9
dtype: int64

# Labels
- If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.
- This label can be used to access a specified value.

In [10]:
a = [1,3,5,7,9]
pd.Series(a)[2]

5


## Create Labels
- With the <span style="color: red">**index**</span> argument, you can name your own labels.

In [16]:
a = [1,3,5,7,9]
b = pd.Series(a, index=["A","B","C","D","E"])
b

A    1
B    3
C    5
D    7
E    9
dtype: int64

- When you have created labels, you can access an item by referring to the label.

In [19]:
pd.Series(b,["B"])

B    3
dtype: int64

---

# Key/Value Objects as Series
- You can also use a key/value object, like a dictionary, when creating a Series.

In [21]:
# Create a simple Pandas Series from a dictionary
bp = {"Morning":120 , "Evening":150 ,"Night":140}
pd.Series(bp)

Morning    120
Evening    150
Night      140
dtype: int64

**Note**: The keys of the dictionary become the labels.

- To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [22]:
# Create a Series using only data from "day1" and "day2"
bp = {"Morning":120 , "Evening":150 ,"Night":140}
pd.Series(bp , index=["Morning","Night"])

Morning    120
Night      140
dtype: int64

---

# DataFrames
- Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
- Series is like a column, a DataFrame is the whole table.

## What is a DataFrame?
* A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [8]:
data = {
    "height":[165,168,150],
    "age":[21,19,17]}
df = pd.DataFrame(data)
df

Unnamed: 0,height,age
0,165,21
1,168,19
2,150,17


## Locate Row
- As you can see from the result above, the DataFrame is like a table with rows and columns.
- Pandas use the loc attribute to return one or more specified row(s)

In [7]:
# refer to the row index:
df.loc[0]

height    165
age        21
Name: 0, dtype: int64

**Note**: This example returns a Pandas Series.

In [10]:
# use a list of indexes
df.loc[[0,2]]   # Return row 0 and 2

Unnamed: 0,height,age
0,165,21
2,150,17


**Note**: When using [], the result is a Pandas DataFrame.

## Named Indexes
* With the index argument, you can name your own indexes.

In [17]:
mydata = {
    "cars":["Rolls royce","BMW","Tesla"],
    "passings":[1,2,3]}
a = pd.DataFrame(mydata , index=["1st choice", "2nd choice", "3rd choice"])
a

Unnamed: 0,cars,passings
1st choice,Rolls royce,1
2nd choice,BMW,2
3rd choice,Tesla,3


## Locate Named Indexes
* Use the named index in the loc attribute to return the specified row(s).

In [18]:
a.loc["3rd choice"]

cars        Tesla
passings        3
Name: 3rd choice, dtype: object

## Load Files Into a DataFrame
* If your data sets are stored in a file, Pandas can load them into a DataFrame.

In [22]:
df = pd.read_csv("iris.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


# Read CSV Files
- A simple way to store big data sets is to use CSV files (comma separated files).
- CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [24]:
df = pd.read_csv("iris.csv")
df.to_string()

'     sepal_length  sepal_width  petal_length  petal_width     species\n0             5.1          3.5           1.4          0.2      setosa\n1             4.9          3.0           1.4          0.2      setosa\n2             4.7          3.2           1.3          0.2      setosa\n3             4.6          3.1           1.5          0.2      setosa\n4             5.0          3.6           1.4          0.2      setosa\n5             5.4          3.9           1.7          0.4      setosa\n6             4.6          3.4           1.4          0.3      setosa\n7             5.0          3.4           1.5          0.2      setosa\n8             4.4          2.9           1.4          0.2      setosa\n9             4.9          3.1           1.5          0.1      setosa\n10            5.4          3.7           1.5          0.2      setosa\n11            4.8          3.4           1.6          0.2      setosa\n12            4.8          3.0           1.4          0.1      setosa\n13   

**Tip**: use to_string() to print the entire DataFrame.
- If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows:

- Print the DataFrame without the <span style="color:red">**to_string()**</span> method

In [26]:
df = pd.read_csv("iris.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## max_rows
- The number of rows returned is defined in Pandas option settings.
- You can check your system's maximum rows with the <span style="color:red">**pd.options.display.max_rows**</span> statement.

In [28]:
# Check the number of maximum returned rows
df = pd.options.display.max_rows
df

60

- In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers and the first and last 5 rows.
- You can change the maximum rows number with the same statement

In [29]:
# Increase the maximum number of rows to display the entire DataFrame
pd.options.display.max_rows = 9999
df = pd.read_csv("iris.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


---

# Pandas Read JSON

## Read JSON
- Big data sets are often stored, or extracted as JSON
- JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In [None]:
# Load the JSON file into a DataFrame:
df = pd.read_json('data.json')
print(df.to_string()) 

## Dictionary as JSON
- JSON = Python Dictionary
- JSON objects have the same format as Python dictionaries.
- If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly

In [30]:
# Load a Python Dictionary into a DataFrame
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


---

# Pandas - Analyzing DataFrames

## Viewing the Data
- One of the most used method for getting a quick overview of the DataFrame, is the head() method.
- The <span style="color:red">**head()**</span> method returns the headers and a specified number of rows, starting from the top.

In [6]:
# Get a quick overview by printing the first 10 rows of the DataFrame
df = pd.read_csv("iris.csv")
df.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


**Note**: if the number of rows is not specified, the <span style="color:red">**head()**</span> method will return the top 5 rows.

In [7]:
df = pd.read_csv("iris.csv")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


- There is also a <span style="color:red">**tail()**</span> method for viewing the last rows of the DataFrame.
- The <span style="color:red">**tail()**</span> method returns the headers and a specified number of rows, starting from the bottom.

In [10]:
df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [11]:
df.tail(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
140,6.7,3.1,5.6,2.4,virginica
141,6.9,3.1,5.1,2.3,virginica
142,5.8,2.7,5.1,1.9,virginica
143,6.8,3.2,5.9,2.3,virginica
144,6.7,3.3,5.7,2.5,virginica
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


## Info About the Data
* The DataFrames object has a method called <span style="color:red">**info()**</span>, that gives you more information about the data set.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


## Result Explained
- The result tells us there are 150 rows and 5 columns:\
RangeIndex: 150 entries, 0 to 149\
Data columns (total 5 columns)
- And the name of each column, with the data type:\ 
 0   sepal_length  150 non-null    float64\
 1   sepal_width   150 non-null    float64\
 2   petal_length  150 non-null    float64\
 3   petal_width   150 non-null    float64\
 4   species       150 non-null    object

## Null Values
- The <span style="color:red">**info()**</span> method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are 150 Non-Null values in the all columns.
- Which means that there are 5 rows with no value at all, in the all  column, for whatever reason.
- Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data, and you will learn more about that in the next chapters.

---

# Cleaning Data

## Data Cleaning
Data cleaning means fixing bad data in your data set.\
Bad data could be:
* Empty cells
* Data in wrong format
* Wrong data
* Duplicates
* 
In this tutorial you will learn how to deal with all of them.

## Cleaning Empty Cells

### **Empty Cells**
Empty cells can potentially give you a wrong result when you analyze data.

### **Remove Rows**
One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.

In [32]:
# Return a new Data Frame with no empty cells
df = pd.read_csv("ship.csv")
new_df = df.dropna()
new_df

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


**Note**: By default, the <span style="color:red">**dropna()**</span> method returns a new DataFrame, and will not change the original.

If you want to change the original DataFrame, use the <span style="color:red">**inplace = True**</span> argument

In [31]:
df = pd.read_csv("ship.csv")
df.dropna(inplace=True)
df

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


**Note**: Now, the <span style="color:red">**dropna(inplace = True)**</span> will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.

### Replace Empty Values
* Another way of dealing with empty cells is to insert a new value instead.
* This way you do not have to delete entire rows just because of some empty cells.
* The <span style="color:red">**fillna()**</span> method allows us to replace empty cells with a value

In [23]:
# Replace NULL values with the number 1
df1 = pd.read_csv("ship.csv")
df.fillna(1 , inplace=True)

### Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.

To only replace empty values for one column, specify the column name for the DataFrame.

In [22]:
df2 = pd.read_csv("ship.csv")
df2["age"].fillna(18, inplace= True)

### Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

Pandas uses the <span style="color:red">**mean()**</span> <span style="color:red">**median()**</span> and <span style="color:red">**mode()**</span> methods to calculate the respective values for a specified column

In [26]:
# Calculate the MEAN, and replace any empty values with it
df3 = pd.read_csv("ship.csv")
t = df["age"].mean()
df3["age"].fillna(t , inplace = True)
df3

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.000000,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.000000,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.000000,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.000000,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.000000,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.000000,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.000000,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,23.997946,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.000000,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


**Mean** = the average value (the sum of all values divided by number of values).

In [27]:
# Calculate the MEDIAN, and replace any empty values with it
df3 = pd.read_csv("ship.csv")
t = df["age"].median()
df3["age"].fillna(t , inplace = True)
df3

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,24.0,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


**Median** = the value in the middle, after you have sorted all values ascending.

In [28]:
# Calculate the MODE, and replace any empty values with it
df3 = pd.read_csv("ship.csv")
t = df["age"].mode()[0]
df3["age"].fillna(t , inplace = True)
df3

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,1.0,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


**Mode** = the value that appears most frequently.

## Cleaning Data of Wrong Format

### Data of Wrong Format
* Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
* To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.

## Removing Duplicates

### Discovering Duplicates
* Duplicate rows are rows that have been registered more than one time.
* To discover duplicates, we can use the <span style="color:red">**duplicated()**</span> method.
* The <span style="color:red">**duplicated()**</span> method returns a Boolean values for each row

In [36]:
# Returns True for every row that is a duplicate, othwerwise False
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool

### Removing Duplicates
- To remove duplicates, use the <span style="color:red">**drop_duplicates()**</span> method.

In [38]:
# Remove all duplicates
df.drop_duplicates(inplace = True)
df

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


## Data Correlations

### Finding Relationships
* A great aspect of the Pandas module is the <span style="color:red">**corr()**</span> method.
* The <span style="color:red">**corr()**</span> method calculates the relationship between each column in your data set.

In [39]:
df.corr()

Unnamed: 0.1,Unnamed: 0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
Unnamed: 0,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658,0.04101,0.057462
survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307,-0.55708,-0.203367
pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495,0.094035,0.135207
age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,0.280328,0.19827
sibsp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651,-0.253586,-0.584471
parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225,-0.349943,-0.583398
fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0,-0.182024,-0.271832
adult_male,0.04101,-0.55708,0.094035,0.280328,-0.253586,-0.349943,-0.182024,1.0,0.404744
alone,0.057462,-0.203367,0.135207,0.19827,-0.584471,-0.583398,-0.271832,0.404744,1.0


**Note**: The Note: The <span style="color:red">**corr()**</span> method ignores "not numeric" columns.

**Result Explained**
The Result of the <span style="color:red">**corr()**</span> method is a table with a lot of numbers that represents how well the relationship is between two columns.

The number varies from -1 to 1.

1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set, each time a value went up in the first column, the other one went up as well.

0.9 is also a good relationship, and if you increase one value, the other will probably increase as well.

-0.9 would be just as good relationship as 0.9, but if you increase one value, the other will probably go down.

0.2 means NOT a good relationship, meaning that if one value goes up does not mean that the other will.

**What is a good correlation?**

 It depends on the use, but I think it is safe to say you have to have at least <span style="color:red">**0.6 (or -0.6)**</span> to call it a good correlation.

**Perfect Correlation:**
We can see that "Duration" and "Duration" got the number <span style="color:red">**1.000000**</span>, which makes sense, each column always has a perfect relationship with itself.

**Good Correlation:**
"Duration" and "Calories" got a <span style="color:red">**0.922721**</span> correlation, which is a very good correlation, and we can predict that the longer you work out, the more calories you burn, and the other way around: if you burned a lot of calories, you probably had a long work out.

**Bad Correlation:**
"Duration" and "Maxpulse" got a <span style="color:red">**0.009403**</span> correlation, which is a very bad correlation, meaning that we can not predict the max pulse by just looking at the duration of the work out, and vice versa.