<a href="https://colab.research.google.com/github/M-Ghodrat/Servus/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

In [29]:
import pandas

In [30]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


Pandas is usually imported under the `pd` alias.

In [31]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


In [32]:
print(pd.__version__)

2.2.2


# What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [33]:
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

0    1
1    7
2    2
dtype: int64


## Labels

In [34]:
print(myvar[0])

1


## Create Labels

In [35]:
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

x    1
y    7
z    2
dtype: int64


In [36]:
print(myvar["y"])

7


Key/Value Objects as Series

In [37]:
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

day1    420
day2    380
day3    390
dtype: int64


In [38]:
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)

day1    420
day2    380
dtype: int64


## DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

In [39]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(myvar)

day1    420
day2    380
dtype: int64


## Locate Row

In [40]:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


**Note**: It returns a Pandas Series.

In [41]:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


When using `[]`, the result is a Pandas DataFrame.

In [42]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


## Locate Named Indexes

In [43]:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


## Read CSV Files

In [44]:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


## Viewing the Data

In [45]:
print(df.head(10))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0


In [46]:
print(df.tail(1))

     Duration  Pulse  Maxpulse  Calories
168        75    125       150     330.4


## Info About the Data

In [47]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


## Remove null Rows

Remove rows that contain empty cells.

In [48]:
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[164 rows x 4 columns]


If you want to change the original DataFrame, use the `inplace = True` argument:

In [49]:
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[164 rows x 4 columns]


Remove rows with a NULL value in the "Date" column:

In [51]:
df.dropna(subset=['Duration'], inplace = True)

## Replace Empty Values

In [52]:
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)

## Replace Only For Specified Columns

In [53]:
df = pd.read_csv('data.csv')
df.fillna({"Calories": 130}, inplace=True)

## Replace Using Mean, Median, or Mode

In [54]:
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
# x = df["Calories"].median()
# x = df["Calories"].mode()[0]
df.fillna({"Calories": x}, inplace=True)

## Replacing Values

In [55]:
df.loc[7, 'Duration'] = 45

In [56]:
for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

## Discovering Duplicates

In [57]:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
164    False
165    False
166    False
167    False
168    False
Length: 169, dtype: bool


## Removing Duplicates

In [58]:
df.drop_duplicates(inplace = True)

## Finding Relationships

In [59]:
df.corr()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
Duration,1.0,-0.245729,-0.081582,0.818958
Pulse,-0.245729,1.0,0.787035,0.015408
Maxpulse,-0.081582,0.787035,1.0,0.194031
Calories,0.818958,0.015408,0.194031,1.0
