#What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.


#Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

#What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?

#Getting started

###Installation of Pandas
```
pip install pandas
```
###Import Pandas
```
import pandas
```
```
import pandas as pd
```
###Checking Pandas Version
```
import pandas as pd

print(pd.__version__)
```




In [2]:
# Example
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


#Panda Data Structure

Let's first get acquainted with two of pandas' primary data
structures: the **Series** and the **DataFrame**. They can handle
the majority of use cases in *finance*, *statistic*, *social science*,
and many areas of engineering.

###SERIES
A Series is a one-dimensional object similar to an array, list,
or column in table. Each item in a Series is assigned to an
entry in an index:

In [None]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

In [None]:
# Create labels
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

Exercises:
1. Creating a Series data structure of 5 random values.
2. Adding labels (i, j, x, y, z) to that Series.
3. Print out the first value (column \"i\")

In [5]:
#Solution
import random


data = pd.Series(random.choices(range(1,5), k=5))
print(data)

0    4
1    4
2    2
3    2
4    2
dtype: int64


###Key/Value Objects as Series
Create a simple Pandas Series from a dictionary:

In [None]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

In [None]:
#Create a Series using only data from "day1" and "day2":

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

###Data Frames
- The DataFrame is a tabular data structure comprising a set
of ordered columns and rows.
- It can be thought of as a group of Series objects that share
an index (the column names). There are a number of ways
to initialize a DataFrame object.
- Firstly, let's take a look at the common example of creating
DataFrame from a dictionary of lists:


In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)

In [None]:
#Locate Row
print(myvar.loc[0,"calories"])
print(myvar.loc[[0, 1]])

In [None]:
#Loate index
print(myvar.iloc[0])

###Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [None]:
#Load the CSV into a DataFrame:
df = pd.read_csv('data.csv')

print(df.to_string())

In [None]:
# max_rows and max_columns
print(pd.options.display.max_rows)
print(pd.options.display.max_columns)


In [None]:
pd.options.display.max_rows = 9999
pd.options.display.max_columns = 4
df = pd.read_csv('data.csv')

print(df)

In [None]:
pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 5)

df = pd.read_csv('data.csv')

print(df)

###Read json

Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In [None]:
df = pd.read_json('data.json')

print(df.to_string())

#Analyzing DataFrames
### Viewing the Data
using head(), tail(), describe(), columns...


> using titanic.csv



In [None]:
# Describe features
# We can use .describe() to extract some standard details about our numerical features.
df.describe()

In [None]:
#The DataFrames object has a method called info(), that gives you more information about the data set.
df.info()

In [None]:
df.head(10)

In [None]:
df.tail(10)

###Filtering dataframes

In [None]:
# Selecting data by feature
df["name"].head()

In [None]:
# Filtering
df[df["sex"]=="female"].head() # only the female data appear

###Sorting
We can also sort our features in ascending or descending order.

In [None]:
# Sorting
df.sort_values("age", ascending=False).head()

###Grouping
We can also get statistics across our features for certain groups. Here we wan to see the average of our continuous features based on whether the passenger survived or not.

In [None]:
# Grouping
survived_group = df.groupby("survived")
survived_group.mean()

####Feature engineering
We're now going to use feature engineering to create a column called family_size. We'll first define a function called get_family_size that will determine the family size using the number of parents and siblings.

In [None]:
# Lambda expressions to create new features
def get_family_size(sibsp, parch):
    family_size = sibsp + parch
    return family_size

In [None]:
df["family_size"] = df[["sibsp", "parch"]].apply(lambda x: get_family_size(x["sibsp"], x["parch"]), axis=1)
df.head()

#Data Cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

- Empty cells
- Data in wrong format
- Wrong data
- Duplicates


###Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.

In [None]:
# Remove Rows
df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

In [None]:
#change the original DataFrame
df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())

In [None]:
#Replace NULL values with the number 130:

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)

In [None]:
#Replace NULL values in the "Calories" columns with the number 130:

df = pd.read_csv('data.csv')

df["Calories"].fillna(130, inplace = True)

In [None]:
#Replace Using Mean, Median, or Mode
df = pd.read_csv('data.csv')

x = df["Calories"].mean() #df["Calories"].median(), df["Calories"].mode()[0]

df["Calories"].fillna(x, inplace = True)

In [None]:
# Dropping multiple columns
df = df.drop(["name", "cabin", "ticket"], axis=1)

In [None]:
# Rows with at least one NaN value
df[pd.isnull(df).any(axis=1)].head()
# Drop rows with Nan values
df = df.dropna() # removes rows with any NaN values
df = df.reset_index() # reset's row indexes in case any rows were dropped

###Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.


In [8]:
# Convert Into a Correct Format
df = pd.read_csv('../../lab05/data/data.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

KeyError: 'Date'

In [None]:
# Remove rows with a NULL value in the "Date" column:

df.dropna(subset=['Date'], inplace = True)

####Fixing Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, it doesn't have to be wrong, but taking in consideration that this is the data set of someone's workout sessions, we conclude with the fact that this person did not work out in 450 minutes.

In [None]:
#Replacing Values
df.loc[7, 'Duration'] = 45

In [None]:
# If the value is higher than 120, set it to 120:

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

In [None]:
# Removing Rows
for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

###Pandas - Removing Duplicates


In [None]:
# Returns True for every row that is a duplicate, otherwise False:

print(df.duplicated())

In [None]:
# Removing Duplicates
# To remove duplicates, use the drop_duplicates() method.

df.drop_duplicates(inplace = True)

#Save data
Finally, let's save our preprocessed data into a new CSV file to use later.

In [None]:
# Saving dataframe to CSV
df.to_csv("cleaned_data.csv", index=False)