# Pandas

## What is Pandas?

**Pandas** is a Python library used for working with **data sets**. It has functions for **analyzing**, **cleaning**, **exploring**, and **manipulating** data.<br/>

The name "Pandas" has a reference to both **"Panel Data"**, and **"Python Data Analysis"** and was created by *Wes McKinney* in 2008.

## Why use Pandas?

Pandas allows us to analyze big data, make conclusions based on statistical theories, clean messy data sets (Delete rows that are not relevant, or contains wrong values, like empty or NULL values.) and make them readable and relevant.

## Installation of Pandas: 

If you have **Python** and **PIP** already installed on a system, then installation of Pandas is very easy.<br/>
Install it using this command:
```
C:\Users\Your Name>pip install pandas
```

### Let's Start!! 

#### Import Pandas:

In [1]:
# Pandas is usually imported under the pd alias.
import pandas as pd

In [2]:
# Checking Pandas Version:
pd.__version__

'2.3.3'

## Series:

A **Pandas Series** is like a column in a table.It is a one-dimensional labeled array that can hold any type of data (numbers, strings, dates, etc.).

In [2]:
# Create a simple series Pandas from a list:
my_list = [4,6,7,5]
my_series = pd.Series(my_list)
my_series

0    4
1    6
2    7
3    5
dtype: int64

### Indexing: 

The values are labeled with their index number. First value has index 0, second value has index 1 etc.<br/>
This label can be used to access a specified value.

In [5]:
# Return the first value in the Series
print(my_series[0])
# or
print(my_series.iloc[0])
# To change a value in other value
my_series[-1] = -1
print(my_series)

4
4
 0    4
 1    6
 2    7
 3    5
-1   -1
dtype: int64


### Create Labels: 

In [7]:
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
# When you have created labels, you can access an item by referring to the label.
print(myvar["y"])
# or
print(myvar.loc["z"])

x    1
y    7
z    2
dtype: int64
7
2


In [9]:
# You can also use a key/value object, like a dictionary, when creating a Series.
# The keys of the dictionary become the labels.
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
print("------")
print(myvar.loc[["day1", "day3"]])

day1    420
day2    380
day3    390
dtype: int64
------
day1    420
day3    390
dtype: int64


## DataFrame:

Data sets in Pandas are usually multi-dimensional tables, called **DataFrames**. Series is like a column, a DataFrame is the whole table.<br/>
**DataFrame** is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [10]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df 

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


### Indexing: 

In [18]:
# As you can see from the result above, the DataFrame is like a table with rows and columns.
# To return one or more specified row(s)
# This example will returns a Pandas Series.
print(df.loc[0])
# Return rows 0 and 1
# When using more than specified rows, the result is a pandas DataFrame
print("-----")
print(df.loc[[0, 1]])

calories    420
duration     50
Name: 0, dtype: int64
-----
   calories  duration
0       420        50
1       380        40


### Named Indexes and Columns:

In [28]:
import numpy as np
data = [[1, 234, 13], [2, 231, 30], [3, 300, 46]]
df = pd.DataFrame(np.array(data), columns=["day", "calories", "duration"], index = ["a", "b", "c"])
df

Unnamed: 0,day,calories,duration
a,1,234,13
b,2,231,30
c,3,300,46


> **Important note**  
> If you want to create data **without key/value pairs**, remember to place the values for each row **inside a list**.

In [29]:
df.loc["b"]

day           2
calories    231
duration     30
Name: b, dtype: int64

### Load files into a DataFrame: 

If your data sets are stored in a file, Pandas can load them into a DataFrame. Load a **comma separated values file (CSV file)** into a DataFrame.<br/>
**CSV files** contains plain text and is a well know format that can be read by everyone including Pandas.<br/>
In our examples we will be using a CSV file called 'data.csv'.<br/>
**[Download data.csv](https://www.w3schools.com/python/pandas/data.csv).**


In [2]:
df = pd.read_csv("data.csv")
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


- The dots in the data mean that your data contains more than rows in your system.

In [4]:
# To know system's maximum rows:
pd.options.display.max_rows

60

- In my system the number is **60**, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return **only the headers and the first and last 5 rows**.

In [None]:
# If you want to see all your data use `to_string()`.
df.to_string()
# See only the first 5 rows.
df.head()
# or you can specific first numbers rows that you want by: `df.head(nbr)`.
# See the last 5 rows.
df.tail()
# See the random 3 rows (choose favorite number).
df.sample(3)

In [8]:
# The DataFrames object has a method called `info()`, that gives you more information about the data set.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


**Result Explained**
- There are 169 rows and 4 columns.
- The table present the name of each column, with the data type.
- In the **calories column**, we observe 164 values of 169 means we have 5 rows with NaN(Empty values or Null values).

## Analyzing DataFrames: 

### Describing Data:

In [3]:
df.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,169.0,169.0,169.0,164.0
mean,63.846154,107.461538,134.047337,375.790244
std,42.299949,14.510259,16.450434,266.379919
min,15.0,80.0,100.0,50.3
25%,45.0,100.0,124.0,250.925
50%,60.0,105.0,131.0,318.6
75%,60.0,111.0,141.0,387.6
max,300.0,159.0,184.0,1860.4


### Sorting Data: 

In [6]:
# To sort a data ascending:
print(df.sort_values("Duration", ascending=True).head(3))
print("-----------")
# To sort it deasending:
print(df.sort_values("Duration", ascending=False).head(3))

     Duration  Pulse  Maxpulse  Calories
112        15    124       139     124.2
93         15     80       100      50.5
58         20    153       172     226.4
-----------
    Duration  Pulse  Maxpulse  Calories
69       300    108       143    1500.2
79       270    100       131    1729.0
60       210    108       160    1376.0


### Add a Column:

In [17]:
df["Date"] = pd.date_range("20201201", periods=168)
df.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories,Date
0,60,110,130,409.1,2020-12-01
1,60,117,145,479.0,2020-12-02
3,45,109,175,282.4,2020-12-03
4,45,117,148,406.0,2020-12-04
5,60,102,127,300.0,2020-12-05


### Deleting a Column: 

In [19]:
# we use df.drop() to delete rows and df.drop(column=) to delete a specific column.
# Delete the row that has 2 in an index.
df = df.drop(2)
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories,Date
0,60,110,130,409.1,2020-12-01
1,60,117,145,479.0,2020-12-02
3,45,109,175,282.4,2020-12-03
4,45,117,148,406.0,2020-12-04
5,60,102,127,300.0,2020-12-05


In [25]:
# To detele a 'Pulse' column.
delete_column = df.drop(columns="Pulse")
print(delete_column.head(3))
# or
print("--------")
# Delete a 'Calories' column.
delete_column = df.drop("Calories", axis=1)
print(delete_column.tail())

   Duration  Maxpulse  Calories       Date
0        60       130     409.1 2020-12-01
1        60       145     479.0 2020-12-02
3        45       175     282.4 2020-12-03
--------
     Duration  Pulse  Maxpulse       Date
164        60    105       140 2021-05-13
165        60    110       145 2021-05-14
166        60    115       145 2021-05-15
167        75    120       150 2021-05-16
168        75    125       150 2021-05-17


### Filtring Data:  

In [26]:
# Filter out people who exercise for more than an hour.
df[df["Duration"] >= 60]

Unnamed: 0,Duration,Pulse,Maxpulse,Calories,Date
0,60,110,130,409.1,2020-12-01
1,60,117,145,479.0,2020-12-02
5,60,102,127,300.0,2020-12-05
6,60,110,136,374.0,2020-12-06
9,60,98,124,269.0,2020-12-09
...,...,...,...,...,...
164,60,105,140,290.8,2021-05-13
165,60,110,145,300.0,2021-05-14
166,60,115,145,310.2,2021-05-15
167,75,120,150,320.4,2021-05-16


## Cleaning Data:

**Data cleaning** means fixing bad data in your data set. Bad data could be:
- Empty cells
- Data in wrong format
- Wrong data
- Duplicates

### Empty Cells: 

#### Remove Rows: 

- Empty cells can potentially give you a wrong result when you analyze data.
- One way to deal with empty cells is to remove rows that contain empty cells.
- When the data sets can be very big, and removing a few rows will not have a big impact on the result.

In [None]:
new_data = df.dropna()
# If you want to change the original DataFrame, use the 'inplace = True' argument.
df.dropna(inplace=True)

#### Replace Empty Values: 

- Another way of dealing with empty cells is to insert a new value instead.
- This way you do not have to delete entire rows just because of some empty cells.

In [None]:
# The 'fillna()' method allows us to replace empty cells with a value.
new_data = df.fillna(130, inplace=True)

### Data of Wrong Format: 

- Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
- To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.

### Wrong Data: 

- **"Wrong data"** does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone registered "176" instead of "1.76" at the height.
- Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it should be.

### Removing Duplicates: 

- **Duplicate** rows are rows that have been registered more than one time.
- To discover duplicates, we can use the `duplicated()` method, that returns a Boolean values for each row.

In [None]:
df.duplicated()

In [None]:
# To remove duplicates, use the 'drop_duplicates()' method.
df.drop_duplicates(inplace=True)