# Introducing Pandas

Pandas is a high level data manipulation package which was built on top of Numpy. The key structure within Pandas are Series and Data Frames


## Series

It is a 1 deminsiontal array with axis labels (an index)

In [4]:
# Importing libraries and packages

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [5]:
# Creating a series from a list

x = pd.Series([10,20,30,40,50])
x

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [6]:
# We can access different components separately

# Accessing index
x.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
# VALUES

x.values

array([10, 20, 30, 40, 50])

In [8]:
# Accessing the data type

x.dtype

dtype('int64')

In [9]:
# Creating a series with an Index

data = [450,650,870]
Sales = Series(data, index=["Don", "Mike", "Edwin"])

## Accessing values

In [13]:
# You can access values using the index name

print(Sales["Don"])

450


In [14]:
print(Sales[0])

450


  print(Sales[0])


# Checking for conditions

In [15]:
# You can filter by booleans
Sales>500

Don      False
Mike      True
Edwin     True
dtype: bool

In [16]:
Sales[[False, True, True]]

Mike     650
Edwin    870
dtype: int64

In [24]:
# If we want to see values greater than 500

Sales[Sales>500]

Mike     650
Edwin    870
dtype: int64

In [25]:
"Don" in Sales

True

In [26]:
"Sally" in Sales

False

In [28]:
450 in Sales
# 450 is not index, it's a value. Thus it will return false

False

### Working with dictionaries

In [44]:
# Coverting a series into a dictionary
sales_dict = Sales.to_dict()
sales_dict

{'Don': 450, 'Mike': 650, 'Edwin': 870}

In [31]:
# Convering a dict to a series

sales_ser = Series(sales_dict)
sales_ser

Don      450
Mike     650
Edwin    870
dtype: int64

In [36]:
### Adding entries and working with Null Values

In [39]:
# We can create a new Series from existing Series
# If we specify name in the index that were NOT already NaN Values will be assigned

new_sales = Series(Sales, index=["Don", "Mike", "Sally", "Edwin", "Lucy"])
new_sales

Don      450.0
Mike     650.0
Sally      NaN
Edwin    870.0
Lucy       NaN
dtype: float64

In [40]:
np.isnan

<ufunc 'isnan'>

In [41]:
pd.isnull(new_sales)

Don      False
Mike     False
Sally     True
Edwin    False
Lucy      True
dtype: bool

In [46]:
Sales.index.name = "Sales Person"
Sales

Sales Person
Don      450
Mike     650
Edwin    870
Name: Total TV Sales, dtype: int64

In [47]:
Sales.name = "Total TV Sales"
Sales

Sales Person
Don      450
Mike     650
Edwin    870
Name: Total TV Sales, dtype: int64

## DataFrames

DataFrames are two dimensional, size-mutable, potentionally hetrogeneous tabular data structures. This data structure contains TWO labeled axis (rows and the columns)

In [20]:
# Creating a DataFrame from list
data = [["Adrian", 20], ["Bethany", 23], ["Chloe", 41]]

# When we create a DataFrame, we can specify what column names are the data ttype
df = pd.DataFrame(data, columns=["Name", "Age"])

print(df)

      Name  Age
0   Adrian   20
1  Bethany   23
2    Chloe   41


### Working with Dictionaries 

We will be looking out how we can create DataFrames from a dictionary in this top

In [19]:
# Creating a dictionary with data for the DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris


In [23]:
# Create a DataFrame from the dictionary with custom indexes
df = pd.DataFrame(data, index=['A', 'B', 'C'])

print(df)

      Name  Age      City
A    Alice   25  New York
B      Bob   30    London
C  Charlie   28     Paris


In [22]:
# Create a list of dictionaries with data for the DataFrame
data = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
        {'Name': 'Bob', 'Age': 30, 'City': 'London'},
        {'Name': 'Charlie', 'Age': 28, 'City': 'Paris'}]

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data)

print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris


In [24]:
# Create a pandas Series with data for the DataFrame
ages = pd.Series([25, 30, 28], index=['Alice', 'Bob', 'Charlie'])

# Create a DataFrame from the Series
df = pd.DataFrame({'Age': ages})

print(df)

         Age
Alice     25
Bob       30
Charlie   28


In [25]:
# Create a pandas Series with data for the new column
cities = pd.Series(['New York', 'London', 'Paris'], index=['Alice', 'Bob', 'Charlie'])

# Add the Series as a new column to the DataFrame
df['City'] = cities

print(df)

         Age      City
Alice     25  New York
Bob       30    London
Charlie   28     Paris


In [31]:
# Shift index down by one position
df_shifted = df.shift(1)

# Print the shifted DataFrame
print(df_shifted)

# Set the 'Name' column as the new index
df_indexed = df.set_index('Name')

# Print the indexed DataFrame
print(df_indexed)

          Age      City
Alice     NaN      None
Bob      25.0  New York
Charlie  30.0    London


KeyError: "None of ['Name'] are in the columns"

In [32]:
# Fill all NaN values with 0
df_filled = df.fillna(0)

# Print the filled DataFrame
print(df_filled)

         Age      City
Alice     25  New York
Bob       30    London
Charlie   28     Paris


### Filling Missing Values with Different Strategies

#### Using backfill(bfill)

In [33]:
# Fill NaN values using backfill method
df_backfill = df.fillna(method='backfill')

# Print the backfilled DataFrame
print(df_backfill)

         Age      City
Alice     25  New York
Bob       30    London
Charlie   28     Paris


  df_backfill = df.fillna(method='backfill')


In [34]:
# Instead of typing out backfill it is better practise to shorthand it to bfill
df_bfill = df.fillna(method='bfill')

# Print the backfilled DataFrame
print(df_bfill)

         Age      City
Alice     25  New York
Bob       30    London
Charlie   28     Paris


  df_bfill = df.fillna(method='bfill')


#### Using Pad

In [36]:
# Fill NaN values using pad method
df_pad = df.fillna(method='pad')

# Print the padded DataFrame
print(df_pad)

         Age      City
Alice     25  New York
Bob       30    London
Charlie   28     Paris


  df_pad = df.fillna(method='pad')


#### Using Forward Fill (ffill)

In [37]:
# Fill NaN values using forward fill method
df_ffill = df.fillna(method='ffill')

# Print the forward filled DataFrame
print(df_ffill)

         Age      City
Alice     25  New York
Bob       30    London
Charlie   28     Paris


  df_ffill = df.fillna(method='ffill')


### Interpolation

In [39]:
# Interpolate missing values
df_interpolated = df.interpolate()

# Print the interpolated DataFrame
print(df_interpolated)

         Age      City
Alice     25  New York
Bob       30    London
Charlie   28     Paris


  df_interpolated = df.interpolate()


### How to drop within a DataFrame

We will be looking at how we can drop Nan Values, Rows, Columns, based on a threshold or based on an index

#### Dropping based on value

In [46]:
# Creating a DataFrame with Nan Values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, np.nan, 28],
        'City': ['New York', np.nan, 'Paris', 'London'],
        'Salary': [50000, 60000, 75000, np.nan]}

df = pd.DataFrame(data)
print(df)

      Name   Age      City   Salary
0    Alice  25.0  New York  50000.0
1      Bob  30.0       NaN  60000.0
2  Charlie   NaN     Paris  75000.0
3    David  28.0    London      NaN


#### Dropping by Rows

In [52]:
# Drop rows with any NaN values
df_dropna_rows = df.dropna()

# Print the DataFrame with dropped rows
print(df_dropna_rows)

    Name   Age      City   Salary
0  Alice  25.0  New York  50000.0


#### Dropping by Columns

In [50]:
# Drop rows with any NaN values
df_dropna_columns = df.dropna(axis=1)

# Print the DataFrame with dropped rows
print(df_dropna_columns)

      Name
0    Alice
1      Bob
2  Charlie
3    David


#### Drop Based on a Threshold of Non-NA Values

In [48]:
# Drop rows with less than 2 non-NA values
df.dropna(thresh=2)
# As you can see anything under 2 Nan Values is being displayed

Unnamed: 0,Name,Age,City,Salary
0,Alice,25.0,New York,50000.0
1,Bob,30.0,,60000.0
2,Charlie,,Paris,75000.0
3,David,28.0,London,


#### Drop Based on Index

In [59]:
# Drop rows with labels '1' and '3'
df.drop(df.index[[1, 3]])

Unnamed: 0,Name,Age,City,Salary
0,Alice,25.0,New York,50000.0
2,Charlie,,Paris,75000.0
