## What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

In [None]:
import pandas as pd

In [None]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar) 

In [None]:
# we can read very easy files from excel or from csv files
df = pd.read_csv('titanic.csv')

In [None]:
# Iff we want to see how data frame looks like we can
df.head() #now it returns 5 first records

In [None]:
# we can also see the last records
df.tail()

In [None]:
# we can check the most important info
df.info()

## Cleaning the data

Data Cleaning

Data cleaning means fixing bad data in your data set.

Bad data could be:

    Empty cells
    Data in wrong format
    Wrong data
    Duplicates

### Empty cells

In [None]:
#if you want to remove NA you have to create new variable DF.

new_df = df.dropna()
new_df.info()

In [None]:
# or use argument inplace = True instead
df.dropna(inplace = True)
df.info()

In [None]:
##  Also you can replece NA values
df = pd.read_csv('titanic.csv')
df.fillna(130, inplace=True)
df.info()

In [None]:
# We can also replace values in specific columns like
df = pd.read_csv('titanic.csv')
df.fillna({'age': 22}, inplace=True)
df.info()

### Also (better options) we can replace the values with mean, median and mode

In [None]:
# with mean
df = pd.read_csv('titanic.csv')
mean = df['age'].mean()
df.fillna({'age': mean}, inplace=True)
df.info()

In [None]:
# with median
df = pd.read_csv('titanic.csv')
median = df['age'].median()
df.fillna({'age': median}, inplace=True)
df.info()

In [None]:
# with mode
df = pd.read_csv('titanic.csv')
mode = df['age'].mode()[0]
df.fillna({'age': mode}, inplace=True)
df.info()

In [None]:
df = pd.DataFrame({
    'Duration': [60, 60, 60, 60, 60, 30, 1929184141, 50],
    'Pulse': [110, 117, 120, 140, 112, 223, 122, 121],
    'Calories': [200, 432, 121, 421, 212, 521, 212, 424]
})
df.head(10)

In [None]:
# We can change the value we are interested to

df.loc[6, 'Duration'] = 45 # Here we are telling the loc function that in row index 6 in Column Duration we change the value to 45
df.head(10)

In [None]:
# We can also use loops like:
for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

df.head(10)

In [None]:
# We can also remove the rows
for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True) 

df.head(10)

In [None]:
# Discovering the duplicates
df = pd.DataFrame({
    'Duration': [60, 60, 60, 60, 60, 30, 1929184141, 50],
    'Pulse': [110, 117, 120, 120, 120, 223, 122, 121],
    'Calories': [200, 432, 121, 121, 121, 521, 212, 424]
})

In [None]:
df.duplicated()

In [None]:
# So now we can remove the duplicates :)
df.drop_duplicates(inplace = True) 

In [None]:
df.head()

### In very easy way we can check the correlation between data

In [None]:
df = pd.read_csv('data.csv')
df.head()

In [None]:
# So the correlation :)
df.corr()

## In very easy way we are able to create some plots
But about plots we will talk in next lessons

In [None]:
#Scatter plot
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show() 

In [None]:
# Histogram
df["Duration"].plot(kind = 'hist')

In [60]:
# Pivot tables!
pd.crosstab(df.Duration, df.Calories) 

Calories,50.3,50.5,77.7,86.2,92.7,100.7,105.3,110.4,124.0,124.2,...,853.0,873.4,953.2,1000.1,1034.4,1115.0,1376.0,1500.2,1729.0,1860.4
Duration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
20,1,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
25,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30,0,0,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
45,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
75,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
90,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


## Excercise time!!!
For data from below do some analysis

In [None]:
import seaborn as sns
sns.get_dataset_names()

In [None]:
df = sns.load_dataset('name_of_set')