## Pandas

#### Pandas is a Python library.

#### Pandas is used to analyze data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

### Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

### What Can Pandas Do?
#### Pandas gives you answers about the data. Like:

- Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?

### Where is the Pandas Codebase?
https://github.com/pandas-dev/pandas

In [10]:
pip install pandas




In [12]:
import pandas

In [14]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


### Pandas as pd

In [1]:
import pandas as pd

In [50]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


### Pandas Version Checking

In [13]:
print(pd.__version__)

2.2.2


### What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [36]:
a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


### Labels


If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [39]:
print(myvar[0])

1


### Create Labels
With the 'index' argument, you can name your own labels.

In [3]:
import pandas as pd
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


In [7]:
print(myvar["x"])

1


### Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

In [15]:
calories = {1: 420.2, 2: 380, 3: 390}

myvar = pd.Series(calories)

print(myvar)

1    420.2
2    380.0
3    390.0
dtype: float64


To select only some of the items in the dictionary, use the "index" argument and specify only the items you want to include in the Series.

In [25]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
dtype: int64


### DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

In [15]:
data = {
  "calories": [420, 380, 390],
  "exercise_time": [50, 40, 45],
    "weight": [52,51,50]
}

myvar = pd.DataFrame(data)

print(myvar)

   calories  exercise_time  weight
0       420             50      52
1       380             40      51
2       390             45      50


In [17]:
type(myvar)

pandas.core.frame.DataFrame

In [29]:
import pandas as pd


data = {
  "calories": [420, 380, 390],
  "exercise_time": [50, 40, 45],
    "weight": [52,51,"a"]
}

df = pd.DataFrame(data)

print(df)

   calories  exercise_time weight
0       420             50     52
1       380             40     51
2       390             45      a


## New Lecture

### Locate Row

Pandas use the "loc" attribute to return one or more specified row(s)

In [23]:
print(df.iloc[0])

calories         420
exercise_time     50
weight            52
Name: 0, dtype: object


In [54]:
type(df.loc[0])

pandas.core.series.Series

Returns Series

If we Use Multi Brackets

In [21]:
print(df.loc[[0,1, 2]])

   calories  exercise_time weight
0       420             50     52
1       380             40     51
2       390             45      a


In [37]:
type(df.loc[[0,1]])

pandas.core.frame.DataFrame

Returns DataFrame

## Named Indexes
With the "index" argument, you can name your own indexes.

In [55]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["", "day2", "day3"])

print(df) 

      calories  duration
           420        50
day2       380        40
day3       390        45


## Locate Named Indexes
Use the named index in the "loc" attribute to return the specified row(s).

In [47]:
print(df.loc["day2"])
# print(type(df.loc[["day2"]]))
# print(type(df.loc["day2"]))

calories    380
duration     40
Name: day2, dtype: int64


In [57]:
print(df.loc[["","day2", "day3"]])

      calories  duration
           420        50
day2       380        40
day3       390        45


Reading and Loading CSV files to a dataframe

In [65]:
path='C:/Users/Ali/Desktop/AI/Datasets/car_price_dataset.csv'
path2='C:/Users/Ali/Desktop/AI/Datasets/csvfile.txt'
df = pd.read_csv(path)
# df2= pd.read_csv(path2)
print(type(df))
print("________________________________________________________")
print(df)
# print(df2)

<class 'pandas.core.frame.DataFrame'>
________________________________________________________
           Brand     Model  Year  Engine_Size Fuel_Type    Transmission  \
0            Kia       Rio  2020          4.2    Diesel          Manual   
1      Chevrolet    Malibu  2012          2.0    Hybrid       Automatic   
2       Mercedes       GLA  2020          4.2    Diesel       Automatic   
3           Audi        Q5  2023          2.0  Electric          Manual   
4     Volkswagen      Golf  2003          2.6    Hybrid  Semi-Automatic   
...          ...       ...   ...          ...       ...             ...   
9995         Kia    Optima  2004          3.7    Diesel  Semi-Automatic   
9996   Chevrolet    Impala  2002          1.4  Electric       Automatic   
9997         BMW  3 Series  2010          3.0    Petrol       Automatic   
9998        Ford  Explorer  2002          1.4    Hybrid       Automatic   
9999  Volkswagen    Tiguan  2001          2.1    Diesel          Manual   

    

In [67]:
df

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,Kia,Rio,2020,4.2,Diesel,Manual,289944,3,5,8501
1,Chevrolet,Malibu,2012,2.0,Hybrid,Automatic,5356,2,3,12092
2,Mercedes,GLA,2020,4.2,Diesel,Automatic,231440,4,2,11171
3,Audi,Q5,2023,2.0,Electric,Manual,160971,2,1,11780
4,Volkswagen,Golf,2003,2.6,Hybrid,Semi-Automatic,286618,3,3,2867
...,...,...,...,...,...,...,...,...,...,...
9995,Kia,Optima,2004,3.7,Diesel,Semi-Automatic,5794,2,4,8884
9996,Chevrolet,Impala,2002,1.4,Electric,Automatic,168000,2,1,6240
9997,BMW,3 Series,2010,3.0,Petrol,Automatic,86664,5,1,9866
9998,Ford,Explorer,2002,1.4,Hybrid,Automatic,225772,4,1,4084


use "to_string()"" to print the Entire DataFrame.

In [71]:
print(df.head().to_string())

        Brand   Model  Year  Engine_Size Fuel_Type    Transmission  Mileage  Doors  Owner_Count  Price
0         Kia     Rio  2020          4.2    Diesel          Manual   289944      3            5   8501
1   Chevrolet  Malibu  2012          2.0    Hybrid       Automatic     5356      2            3  12092
2    Mercedes     GLA  2020          4.2    Diesel       Automatic   231440      4            2  11171
3        Audi      Q5  2023          2.0  Electric          Manual   160971      2            1  11780
4  Volkswagen    Golf  2003          2.6    Hybrid  Semi-Automatic   286618      3            3   2867


## printing Head

In [77]:
df.head(1)

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,Kia,Rio,2020,4.2,Diesel,Manual,289944,3,5,8501


## printing Tail

In [83]:
df.tail(1)

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
9999,Volkswagen,Tiguan,2001,2.1,Diesel,Manual,157882,3,3,3342


## Checking Shape of Dataset

In [85]:
df.shape

(10000, 10)

## Max Rows when we simply display a Dataframe

In [87]:
print(pd.options.display.max_rows) 

60


## Increase the maximum number of rows to display the entire DataFrame

In [91]:
pd.options.display.max_rows = 100000

df = pd.read_csv(path)

print(df.head()) 

        Brand   Model  Year  Engine_Size Fuel_Type    Transmission  Mileage  \
0         Kia     Rio  2020          4.2    Diesel          Manual   289944   
1   Chevrolet  Malibu  2012          2.0    Hybrid       Automatic     5356   
2    Mercedes     GLA  2020          4.2    Diesel       Automatic   231440   
3        Audi      Q5  2023          2.0  Electric          Manual   160971   
4  Volkswagen    Golf  2003          2.6    Hybrid  Semi-Automatic   286618   

   Doors  Owner_Count  Price  
0      3            5   8501  
1      2            3  12092  
2      4            2  11171  
3      2            1  11780  
4      3            3   2867  


In [105]:
doors_column = df['Doors']
print(doors_column.head())

0    3
1    2
2    4
3    2
4    3
Name: Doors, dtype: int64


In [107]:
type(doors_column)

pandas.core.series.Series

## Accessing Multiple Rows or Columns

In [111]:
subset = df.loc[0:2, ['Brand', 'Model','Doors']]
print(subset)

       Brand   Model  Doors
0        Kia     Rio      3
1  Chevrolet  Malibu      2
2   Mercedes     GLA      4


##  Accessing Rows Based on Conditions

In [123]:
filtered_data = df[df['Year'] == 2010 ]
filter_2= df [df['Model']=='Civic']
print(filter_2.head().to_string())

    Brand  Model  Year  Engine_Size Fuel_Type    Transmission  Mileage  Doors  Owner_Count  Price
6   Honda  Civic  2010          3.4  Electric       Automatic   139584      3            1  11208
42  Honda  Civic  2000          3.9    Hybrid          Manual    56020      2            4   7879
53  Honda  Civic  2015          2.6  Electric  Semi-Automatic    66777      5            1  11864
59  Honda  Civic  2005          2.4    Petrol       Automatic    60316      2            5   8293
83  Honda  Civic  2020          4.6    Petrol          Manual   191397      3            4  10872


## Accessing Specific Cells with at and iat

In [127]:
model_at_index_2 = df.at[6, 'Model']
print(model_at_index_2)

Civic


In [1]:
import pandas as pd

## New Lecture

In [17]:
df = pd.DataFrame([['1993', 'Avi', 5, 41, 70, 'Bob'],  
                   ['1994', 'Cathy', 10, 1, 22, 'Cathy'],  
                   ['1995', 'Cathy', 24, 11, 44, 'Bob'],  
                   ['1996', 'Bob', 2, 11, 10, 'Avi'],  
                   ['1998', 'Avi', 20, 10, 40, 'Avi'], 
                   ['1999', 'Avi', 50, 8, 11, 'Cathy']], 
                  columns=('Patients', 'Name', 'Avi', 'Bob', 'Cathy', 'Aname')) 

df

Unnamed: 0,Patients,Name,Avi,Bob,Cathy,Aname
0,1993,Avi,5,41,70,Bob
1,1994,Cathy,10,1,22,Cathy
2,1995,Cathy,24,11,44,Bob
3,1996,Bob,2,11,10,Avi
4,1998,Avi,20,10,40,Avi
5,1999,Avi,50,8,11,Cathy


### Pandas Dataframe.pop()

In [19]:
poped_col=df.pop('Avi')

In [21]:
poped_col

0     5
1    10
2    24
3     2
4    20
5    50
Name: Avi, dtype: int64

## Drop()

In [23]:
df

Unnamed: 0,Patients,Name,Bob,Cathy,Aname
0,1993,Avi,41,70,Bob
1,1994,Cathy,1,22,Cathy
2,1995,Cathy,11,44,Bob
3,1996,Bob,11,10,Avi
4,1998,Avi,10,40,Avi
5,1999,Avi,8,11,Cathy


In [25]:
df=df.drop(columns=["Bob"])  # Removes column but doesn't return it
df

Unnamed: 0,Patients,Name,Cathy,Aname
0,1993,Avi,70,Bob
1,1994,Cathy,22,Cathy
2,1995,Cathy,44,Bob
3,1996,Bob,10,Avi
4,1998,Avi,40,Avi
5,1999,Avi,11,Cathy


## Get()

In [31]:
df.get("Name") 

0      Avi
1    Cathy
2    Cathy
3      Bob
4      Avi
5      Avi
Name: Name, dtype: object

In [37]:
path='Datasets/car_price_dataset.csv'
data=pd.read_csv(path)
data.head()

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,Kia,Rio,2020,4.2,Diesel,Manual,289944,3,5,8501
1,Chevrolet,Malibu,2012,2.0,Hybrid,Automatic,5356,2,3,12092
2,Mercedes,GLA,2020,4.2,Diesel,Automatic,231440,4,2,11171
3,Audi,Q5,2023,2.0,Electric,Manual,160971,2,1,11780
4,Volkswagen,Golf,2003,2.6,Hybrid,Semi-Automatic,286618,3,3,2867


In [45]:
x_train= data.get(['Brand','Model','Engine_Size'])
x_train.head()
y_train=data.get(['Price'])
# y_train.head()


In [33]:
d=df.get(["Patients", "Name","Cathy"])
d 

Unnamed: 0,Patients,Name,Cathy
0,1993,Avi,70
1,1994,Cathy,22
2,1995,Cathy,44
3,1996,Bob,10
4,1998,Avi,40
5,1999,Avi,11


## Pandas DataFrame.isin()

In [47]:
df

Unnamed: 0,Patients,Name,Cathy,Aname
0,1993,Avi,70,Bob
1,1994,Cathy,22,Cathy
2,1995,Cathy,44,Bob
3,1996,Bob,10,Avi
4,1998,Avi,40,Avi
5,1999,Avi,11,Cathy


In [51]:
new = df["Name"].isin(["Avi"])
# print(new)
data1=df[new]
print(data1)

  Patients Name  Cathy  Aname
0     1993  Avi     70    Bob
4     1998  Avi     40    Avi
5     1999  Avi     11  Cathy


In [67]:
f1= df ["Name"].isin (["Avi", "Bob"])
f3= df ['Aname'].isin (["Cathy", "Bob"])
f2= df ["Patients"].isin (["1993", "1996"])
# f1 & f2
data=df[f1 & f2 | f3]
print(data)

  Patients   Name  Cathy  Aname
0     1993    Avi     70    Bob
1     1994  Cathy     22  Cathy
2     1995  Cathy     44    Bob
3     1996    Bob     10    Avi
5     1999    Avi     11  Cathy


## Pandas DataFrame.where()

In [122]:
import numpy as np

In [126]:
df = pd.DataFrame([['1999', 'Avi', 5, 41, 70, 'Bob'],  
                   [np.nan, 'Cathy', 10, 1, 22, 'Cathy'],  
                   ['1993', 'Cathy', 24, 11, 44, 'Bob'],  
                   ['1996', 'Bob', 2, 11, 10, 'Avi'],  
                   ['1992', 'Avi', 20, 10, 40, 'Avi'], 
                   ['1991', 'Avi', 50, 8, 11, 'Cathy']], 
                  columns=('Patients', 'Name', 'Avi', 'Bob', 'Cathy', 'Aname')) 

df

Unnamed: 0,Patients,Name,Avi,Bob,Cathy,Aname
0,1999.0,Avi,5,41,70,Bob
1,,Cathy,10,1,22,Cathy
2,1993.0,Cathy,24,11,44,Bob
3,1996.0,Bob,2,11,10,Avi
4,1992.0,Avi,20,10,40,Avi
5,1991.0,Avi,50,8,11,Cathy


In [91]:
df.sort_values("Patients", inplace=True)

# Correct filtering
df1 = df[df["Patients"] == "1996"]  # This filters out rows where "Patients" is not 1993

df1 


Unnamed: 0,Patients,Name,Avi,Bob,Cathy,Aname
3,1996,Bob,2,11,10,Avi


In [93]:
df

Unnamed: 0,Patients,Name,Avi,Bob,Cathy,Aname
5,1991,Avi,50,8,11,Cathy
4,1992,Avi,20,10,40,Avi
2,1993,Cathy,24,11,44,Bob
3,1996,Bob,2,11,10,Avi
1,1998,Cathy,10,1,22,Cathy
0,1999,Avi,5,41,70,Bob


In [97]:
df.describe()

Unnamed: 0,Avi,Bob,Cathy
count,6.0,6.0,6.0
mean,18.5,13.666667,32.833333
std,17.615334,13.90923,23.120698
min,2.0,1.0,10.0
25%,6.25,8.5,13.75
50%,15.0,10.5,31.0
75%,23.0,11.0,43.0
max,50.0,41.0,70.0


## Null Check

In [128]:
ser= df.get("Patients")

In [130]:
pd.isna(ser)

0    False
1     True
2    False
3    False
4    False
5    False
Name: Patients, dtype: bool

In [132]:
path="Datasets/car_price_dataset.csv"
dataa=pd.read_csv(path)

In [136]:
dataa.head()

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,,Rio,2020,4.2,Diesel,Manual,289944,3,5,8501
1,Chevrolet,Malibu,2012,2.0,Hybrid,Automatic,5356,2,3,12092
2,Mercedes,GLA,2020,4.2,Diesel,Automatic,231440,4,2,11171
3,Audi,Q5,2023,2.0,Electric,Manual,160971,2,1,11780
4,Volkswagen,Golf,2003,2.6,Hybrid,Semi-Automatic,286618,3,3,2867


In [172]:
d1=dataa['Brand']

In [176]:
d1.isna()

0        True
1       False
2       False
3       False
4       False
        ...  
9995    False
9996    False
9997    False
9998    False
9999    False
Name: Brand, Length: 10000, dtype: bool