---

# Ioannou_Georgios


## Copyright © 2023 by Georgios Ioannou


---

<h1 align="center"> Pandas </h1>
<h2 align="left"> Table of Contents </h2>
<ol>
  <li> <i style="color:red"> Indexing and Selection: </i> How to select and slice data using methods like <i style="color:red"> .loc[], .iloc[], and boolean indexing.</i> </li>

  <li> <i style="color:red"> Data Cleaning: </i> Handling missing data <i style="color:red">(isna(), dropna(), fillna()), duplicate records (duplicated(), drop_duplicates()),</i> and dealing with inconsistent data. </li>
</ol>


---

<h2 align="center"> Libraries </h2>


In [1]:
# Import libraries.

import numpy as np  # numpy for data handling/wrangling.
import pandas as pd  # pandas for data handling/wrangling.

---

<h2 align="center"> Main Code </h2>


---

<h3 align="center" style="color:red"> Indexing and Selection </h3>


In [2]:
# Our dataframe as a Python dictionary.
# keys = columns
# values = rows

# data = {
#     "Quantity": [1, 2, 3, 4, 5],
#     "Price": [10, 20, 30, 40, 50],
#     "Item": ["apple", "banana", "cherry", "date", "elderberry"],
# }


data = {
    "Quantity": [1, 2, np.nan, 4, 5, 2, 2, 4],
    "Price": [10, np.nan, 30, 40, 50, 60, 60, 40],
    "Item": [
        "apple",
        "cucumber",
        np.nan,
        "celery",
        "elderberry",
        "pepper",
        "pepper",
        "celery",
    ],
    "Category": [
        "Fruit",
        "Veg",
        "fruit",
        "Vegetable",
        "Fruit",
        "Veg",
        "Veg",
        "Vegetable",
    ],
}

# Convert th previous Python dictionary to a Pandas DataFrame. index for easier illustration.

df = pd.DataFrame(
    data, index=["row1", "row2", "row3", "row4", "row5", "row6", "row7", "row8"]
)

In [3]:
# Print/Display/Output the Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


---

<h4 align="center"> <u> loc </u> </h4>
<h4 align="center"> <i style="color:red"> label-based </i> indexing </h4>


In [4]:
# Select a single row by label.

df.loc["row2"]

Quantity         2.0
Price            NaN
Item        cucumber
Category         Veg
Name: row2, dtype: object

In [5]:
# Select a single element(sample).

df.loc["row3", "Price"]

30.0

In [6]:
# Select multiple rows by label.

df.loc[["row2", "row4"]]

Unnamed: 0,Quantity,Price,Item,Category
row2,2.0,,cucumber,Veg
row4,4.0,40.0,celery,Vegetable


In [7]:
# Select specific rows and columns by label.

df.loc[["row1", "row3"], ["Quantity", "Item"]]

Unnamed: 0,Quantity,Item
row1,1.0,apple
row3,,


---

<h4 align="center"> <u> iloc </u> </h4>
<h4 align="center"> <i style="color:red"> integer-based </i> indexing </h4>


In [8]:
# Print/Display/Output the Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


In [9]:
# Select a single row by integer location.

# First row.

print(df.iloc[0])

# Second row.

df.iloc[1]

Quantity      1.0
Price        10.0
Item        apple
Category    Fruit
Name: row1, dtype: object


Quantity         2.0
Price            NaN
Item        cucumber
Category         Veg
Name: row2, dtype: object

In [10]:
# Select a single element(sample) by integer location.

df.iloc[2, 1]

30.0

In [11]:
# Select multiple rows by integer location.

df.iloc[[1, 3]]

Unnamed: 0,Quantity,Price,Item,Category
row2,2.0,,cucumber,Veg
row4,4.0,40.0,celery,Vegetable


In [12]:
# Select multiple rows and columns by integer location.

df.iloc[[0, 2], [0, 2]]

Unnamed: 0,Quantity,Item
row1,1.0,apple
row3,,


---

<h4 align="center"> <u> Boolean Indexing </u> </h4>
<h4 align="center"> filter rows based on a condition </h4>


In [13]:
# Select rows where column 'Quantity' is greater than 3.

condition = df["Quantity"] > 3
df[condition]

Unnamed: 0,Quantity,Price,Item,Category
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row8,4.0,40.0,celery,Vegetable


In [None]:
# Select rows where column 'C' contains the character 'a'.

condition = df["Item"].str.contains("a")
df[condition]


### ValueError: Cannot mask with non-boolean array containing NA / NaN values ###

In [14]:
# Combining multiple conditions.

condition1 = df["Quantity"] > 2
condition2 = df["Price"] < 40
df[condition1 & condition2]  # Using '&' for 'and'

Unnamed: 0,Quantity,Price,Item,Category


In [15]:
# Combining multiple conditions.

condition1 = df["Quantity"] > 2
condition2 = df["Price"] < 41
df[condition1 & condition2]  # Using '&' for 'and'

Unnamed: 0,Quantity,Price,Item,Category
row4,4.0,40.0,celery,Vegetable
row8,4.0,40.0,celery,Vegetable


---

<h3 align="center" style="color:red"> Data Cleaning </h3>


---

<h4 align="center"> <u> Handling Missing Data </u> </h4>


In [16]:
# Print/Display/Output the Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


---

<h4 align="left"> Identify Missing Values </h4>


In [17]:
df.isna()

Unnamed: 0,Quantity,Price,Item,Category
row1,False,False,False,False
row2,False,True,False,False
row3,True,False,True,False
row4,False,False,False,False
row5,False,False,False,False
row6,False,False,False,False
row7,False,False,False,False
row8,False,False,False,False


---

<h4 align="left"> Drop Rows with Missing Values </h4>


In [18]:
# Drop rows.

df_tmp = df.dropna()

# Print/Display/Output the Pandas DataFrame.

df_tmp

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


In [19]:
# Prvioosly: ### ValueError: Cannot mask with non-boolean array containing NA / NaN values ###

# Select rows where column 'C' contains the character 'a'.

condition = df_tmp["Item"].str.contains("a")
df_tmp[condition]

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit


---

<h4 align="left"> Fill Missing Values </h4>


In [20]:
# Filling missing values with 0.

df_tmp = df.fillna(0)

# Print/Display/Output the Pandas DataFrame.

df_tmp

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,0.0,cucumber,Veg
row3,0.0,30.0,0,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


---

<h4 align="center"> <u> Handling Duplicate Records </u> </h4>


In [21]:
# Print/Display/Output the Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


---

<h4 align="left"> Identify Duplicate Rows </h4>


In [22]:
# PIdentify duplicate rows.

df.duplicated()

row1    False
row2    False
row3    False
row4    False
row5    False
row6    False
row7     True
row8     True
dtype: bool

---

<h4 align="left"> Drop Duplicate Rows </h4>


In [23]:
# Drop duplicate rows.

df.drop_duplicates()

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg


---

<h4 align="center"> <u> Dealing With Inconsistent Data </u> </h4>


In [24]:
# Print/Display/Output the Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


---

<h4 align="left"> Convert Text to Lowercase/Uppercase </h4>


---

<h3 align="center"> <i style="color:red"> NEVER </i> </h3>
<h3 align="center"> df_tmp = df </h3>
<h3 align="center"> to make a temporary copy of the Pandas DataFrame </h3>


In [25]:
# To make text data consistent, you can convert everything to lowercase using .str.lower().
# OR
# To make text data consistent, you can convert everything to uppercase using .str.upper().

# Make a copy of the Pandas DataFrame.

df_tmp = df.copy()

# Convert to lower case all item in the "Category" column of the Pandas DataFrame.
df_tmp["Category"] = df_tmp["Category"].str.lower()

# Print/Display/Output the Pandas DataFrame.

df_tmp

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,fruit
row2,2.0,,cucumber,veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,vegetable
row5,5.0,50.0,elderberry,fruit
row6,2.0,60.0,pepper,veg
row7,2.0,60.0,pepper,veg
row8,4.0,40.0,celery,vegetable


---

<h4 align="left"> Replace Inconsistent Values </h4>


In [26]:
# Replace inconsistent values with a consistent value using .replace().

# Make a copy of the Pandas DataFrame.

df_tmp = df.copy()

df_tmp["Category"] = df_tmp["Category"].replace({"Fruit": "fruit"})
df_tmp["Category"] = df_tmp["Category"].replace({"Veg": "vegetable"})
df_tmp["Category"] = df_tmp["Category"].replace({"Vegetable": "vegetable"})

# Print/Display/Output the Pandas DataFrame.

df_tmp

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,fruit
row2,2.0,,cucumber,vegetable
row3,,30.0,,fruit
row4,4.0,40.0,celery,vegetable
row5,5.0,50.0,elderberry,fruit
row6,2.0,60.0,pepper,vegetable
row7,2.0,60.0,pepper,vegetable
row8,4.0,40.0,celery,vegetable


In [27]:
# Print/Display/Output the Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


---

<h3 align="center" style="color:red"> Pipeline Function </h3>


In [28]:
# pipeline() function.
# Input: Pandas DataFrame
# Output: Pandas DataFrame with rows that have missing values dropped, duplicate rows dropped, "Category" column converted to lowervase and 2 classes (fruit and vegetable)


def pipeline(df):
    df = df.dropna()

    df = df.drop_duplicates()

    df["Category"] = df["Category"].str.lower()

    df["Category"] = df["Category"].replace({"fruit": "fruit"})
    df["Category"] = df["Category"].replace({"veg": "vegetable"})
    df["Category"] = df["Category"].replace({"vegetable": "vegetable"})
    return df

In [29]:
# Print/Display/Output the orginal Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,Fruit
row2,2.0,,cucumber,Veg
row3,,30.0,,fruit
row4,4.0,40.0,celery,Vegetable
row5,5.0,50.0,elderberry,Fruit
row6,2.0,60.0,pepper,Veg
row7,2.0,60.0,pepper,Veg
row8,4.0,40.0,celery,Vegetable


In [30]:
df = pipeline(df)

# Print/Display/Output the processed Pandas DataFrame.

df

Unnamed: 0,Quantity,Price,Item,Category
row1,1.0,10.0,apple,fruit
row4,4.0,40.0,celery,vegetable
row5,5.0,50.0,elderberry,fruit
row6,2.0,60.0,pepper,vegetable


In [None]:
na_values = ["NO CLUE", "N/A", "0"]
requests = pd.read_csv(
    "../data/311-service-requests.csv", na_values=na_values, dtype={"Incident Zip": str}
)


# By default the following values are interpreted as NaN: “ “, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “.