# **Pandas**
Pandas is an open-source data manipulation and analysis library for Python. It provides data structures and functions needed to work with structured data seamlessly.


**The two primary data structures in Pandas are**:

**Series**: A one-dimensional labeled array capable of holding data of any type (integer, string, float, etc.). It is similar to a column in a table or a single column in an Excel spreadsheet.

**DataFrame**: A two-dimensional labeled data structure with columns of potentially different types. It's similar to a table in a relational database or an Excel spreadsheet. Each column in a DataFrame is a Series.

**Pandas is commonly used for tasks such as**:
- **Data Cleaning**: Handling missing values, filtering data, and preparing it for analysis.
- **Data Exploration**: Summarizing and visualizing data to understand its structure and patterns.
- **Data Transformation**: Aggregating, merging, reshaping, and applying functions to data.
- **Data Analysis**: Performing statistical analysis and generating insights.

### Pandas Installation
In order to use Pandas, we first need to install it.

In [1]:
%pip install pandas

#Import the library
import pandas as pd


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


# DataFrames
A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, offering functionalities for data analysis.

**Creating an Empty DataFrame**

In [2]:
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


**Creating a DataFrame using a list of lists**

In [8]:
data = [['Alice' , 23], ['Mike', 34],['John', 34]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df 
# print(df)

Unnamed: 0,Name,Age
0,Alice,23
1,Mike,34
2,John,34


**Creating a DataFrame using a dictionary of lists**

In [6]:
data = {
    'Name': ['Alice', 'Mike', 'John'],
    'Age' : [23,34,34]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,23
1,Mike,34
2,John,34


**Creating a DataFrame using a list of dictionaries**

In [9]:
data = [{'name': 'Alice', 'age': 23}, {'name': 'Mike', 'age': 34}, {'name': 'John' , 'age': 34}]
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,Alice,23
1,Mike,34
2,John,34


# Series
A pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, etc.). It's similar to a one-column table or an array with associated labels, providing powerful indexing and manipulation capabilities in Python.

**Creating Series**

In [10]:
s = pd.Series([1,2,3,4,5,6,7,8,9])
s

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64

### **Pandas Data Types**


Numeric:
- Integer (int64): Represents whole numbers (e.g., 10, -5). This is the default integer type in pandas.
- Float (float64): Represents numbers with decimals (e.g., 3.14, -12.5).
- Boolean (bool): Represents logical True or False values.
- Object: This is a versatile but less efficient type that can store various data types like strings, lists, or custom objects. Pandas uses this type when it cannot infer a more specific data type.

In [22]:
# Integer (int64)
int_series = pd.Series([1,2,3,4,5])
int_series
# Float (float64)
float_series = pd.Series([1.23,1.25, 1.27, 1.29, 1.31])
float_series
# Boolean (bool)
bool_series = pd.Series([True, False, False, False , True])
bool_series
# Object (object)
object_series = pd.Series([True, 30.2, 'Jane', 30])
object_series

0     True
1    False
2    False
3    False
4     True
dtype: bool
0    True
1    30.2
2    Jane
3      30
dtype: object


Specialized Data Types:
- Datetime (datetime64[ns]): Represents dates and times with nanosecond precision. Useful for time-series data analysis.
- Timedelta (timedelta64[ns]): Represents durations between timestamps.
- Categorical: Represents categorical data with predefined categories. Efficient for storing limited sets of categories.
- Sparse: Represents sparse data with many missing values. Stores data efficiently by only keeping non-zero values.

In [26]:
#Datetime (datetime64)
dt = pd.to_datetime('2024-07-29')
dt
# Datetime Series
datetime_series = pd.Series([pd.to_datetime('2024-08-12'), pd.to_datetime('2024-11-11')])
datetime_series

0   2024-08-12
1   2024-11-11
dtype: datetime64[ns]

**Sales Data Analysis**


You have a dataset of sales transactions that includes the product name, quantity sold, and sale price.
You want to analyze the data to find the total revenue per product.

In [31]:
data = {
    'Product Name':['Laptop', 'Smartphone', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
    'Quantity Sold':[3, 2, 5, 4, 1, 2],
    'Sale Price':[1000, 800, 600, 1100, 850, 1050]
}

# Step 1: Create a DataFrame
sales_df = pd.DataFrame(data)
sales_df

# Step 2: Create a new column for Total Revenue for each row
sales_df['Total Revenue'] = sales_df['Quantity Sold'] * sales_df['Sale Price']
sales_df

# Step 3: Group column values (groupby)
total_revenue = sales_df.groupby('Product Name')['Total Revenue'].sum()
total_revenue 
# Step 4: Create a new dataframe to show the data in tabular format
results_df = pd.DataFrame()

# Step 5: Show the data in tabular format
results_df['Total Revenue'] = sales_df.groupby('Product Name')['Total Revenue'].sum()
results_df

Unnamed: 0_level_0,Total Revenue
Product Name,Unnamed: 1_level_1
Laptop,9500
Smartphone,2450
Tablet,3000


**Recalling our Sales DataFrame**

In [32]:
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,Laptop,3,1000,3000
1,Smartphone,2,800,1600
2,Tablet,5,600,3000
3,Laptop,4,1100,4400
4,Smartphone,1,850,850
5,Laptop,2,1050,2100


### **Data Selection**

pandas provides numerous methods for selecting and indexing data in Series and DataFrames, including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

#### Data Selection in Series

**Check the First Two Rows of Product Names**

In [33]:
sales_df['Product Name'][0:2]

0        Laptop
1    Smartphone
Name: Product Name, dtype: object

**Check the First Two Rows of Quantity Sold**

In [34]:
sales_df['Quantity Sold'][0:2]

0    3
1    2
Name: Quantity Sold, dtype: int64

**Sum method for totaling**

In [35]:
sales_df['Quantity Sold'][0:2].sum()

np.int64(5)

**Index Location (.iloc)**

In [37]:
sales_df.iloc[0:2]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,Laptop,3,1000,3000
1,Smartphone,2,800,1600


In [41]:
"""
Location (.loc)
- Access a group of rows and columns by label(s) or a boolean array.
"""

# Get only specific columns
sales_df.loc[0:3, ['Product Name', 'Sale Price']]

Unnamed: 0,Product Name,Sale Price
0,Laptop,1000
1,Smartphone,800
2,Tablet,600
3,Laptop,1100


# Conditional Filtering

**Checking for Total Revenues greater than or equal to 40**

In [45]:
sales_df[sales_df['Total Revenue'] >= 2000]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,Laptop,3,1000,3000
2,Tablet,5,600,3000
3,Laptop,4,1100,4400
5,Laptop,2,1050,2100


**Checking for Product Names equal to `'Laptop'`**

In [46]:
sales_df[sales_df['Product Name'] == 'Laptop']

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,Laptop,3,1000,3000
3,Laptop,4,1100,4400
5,Laptop,2,1050,2100


**Filtering Customer Reviews**

A DataFrame contains customer reviews for different products, including a numeric rating. You need to filter reviews to find all reviews of a specific product with a rating of 4 or higher.

In [47]:
reviews_data = {
    'ProductID': [
        'Coffee Maker', 'Blender', 'Vacuum Cleaner', 'Air Fryer', 'Microwave',
        'Dishwasher', 'Toaster', 'Refrigerator', 'Oven', 'Washing Machine',
        'Slow Cooker', 'Mixer', 'Electric Kettle', 'Juicer', 'Rice Cooker',
        'Food Processor', 'Grill', 'Bread Maker', 'Espresso Machine', 'Dehumidifier'
    ],
    'Rating': [5, 3, 2, 1, 4, 3, 2, 4, 5, 1, 4, 5, 4, 3, 5, 4, 5, 4, 5, 2]
}

reviews_df = pd.DataFrame(reviews_data)
print(reviews_df)


           ProductID  Rating
0       Coffee Maker       5
1            Blender       3
2     Vacuum Cleaner       2
3          Air Fryer       1
4          Microwave       4
5         Dishwasher       3
6            Toaster       2
7       Refrigerator       4
8               Oven       5
9    Washing Machine       1
10       Slow Cooker       4
11             Mixer       5
12   Electric Kettle       4
13            Juicer       3
14       Rice Cooker       5
15    Food Processor       4
16             Grill       5
17       Bread Maker       4
18  Espresso Machine       5
19      Dehumidifier       2


**Finnding Reviews with Rating of 4 and above**

In [48]:
reviews_df[reviews_df['Rating'] >= 4]

Unnamed: 0,ProductID,Rating
0,Coffee Maker,5
4,Microwave,4
7,Refrigerator,4
8,Oven,5
10,Slow Cooker,4
11,Mixer,5
12,Electric Kettle,4
14,Rice Cooker,5
15,Food Processor,4
16,Grill,5


## Pandas Operators

**Previewing Data with `head()`**

This function is used to view the first few rows of a DataFrame. By default, it returns the first five rows, but you can specify the number of rows you want to see by passing an argument. This function is useful for quickly inspecting the data and understanding its structure, especially when working with large datasets.

In [49]:
reviews_df.head()

Unnamed: 0,ProductID,Rating
0,Coffee Maker,5
1,Blender,3
2,Vacuum Cleaner,2
3,Air Fryer,1
4,Microwave,4


**Inspecting the End with `tail()`**:

The tail() function works similarly to head(), but it displays the last few rows of the DataFrame. Like head(), it shows five rows by default, but you can specify a different number. This is helpful when you want to inspect the end of your dataset, such as the most recent entries in a time series.

In [50]:
reviews_df.tail(2)

Unnamed: 0,ProductID,Rating
18,Espresso Machine,5
19,Dehumidifier,2


**Summarizing Your Data with `describe()`**:

The describe() function generates summary statistics of numeric columns in the DataFrame. It provides key metrics such as count, mean, standard deviation, minimum and maximum values, and the quartiles (25th, 50th, and 75th percentiles). This function is invaluable for quickly getting an overview of the distribution and central tendencies of your data.

In [58]:
reviews_df.describe()

Unnamed: 0,Rating
count,20.0
mean,3.55
std,1.356272
min,1.0
25%,2.75
50%,4.0
75%,5.0
max,5.0


**Understanding Data Structure with info():**

The info() function gives a concise summary of the DataFrame, including the number of non-null entries in each column, the data type of each column, and the memory usage of the DataFrame. This is particularly useful for understanding the structure of your data, identifying missing values, and optimizing memory usage.

In [51]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ProductID  20 non-null     object
 1   Rating     20 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 452.0+ bytes


## Data Analysis:
**Calculating Averages with mean():**

The mean() function calculates the average value of a numeric Series or the mean of each numeric column in a DataFrame. It is often used to determine the central tendency of the data, providing insight into the typical value in a dataset.

In [52]:
print(reviews_df['Rating'].mean())

3.55


**Finding the Middle Value with median()**

The median() function computes the median value, which is the middle value in a sorted list of numbers. Unlike the mean, the median is less affected by outliers and skewed data, making it a robust measure of central tendency.

In [53]:
print(reviews_df['Rating'].median())

4.0


**Measuring Data Spread with  Standard Deviation**

The std() function calculates the standard deviation, a measure of the dispersion or spread of values around the mean. A higher standard deviation indicates that the data points are more spread out from the mean, while a lower standard deviation suggests that they are closer to the mean.

In [54]:
print(reviews_df['Rating'].std())

1.3562719801759993


**Assessing Variability with var()**

The var() function computes the variance, which is the square of the standard deviation. Variance measures the extent to which the data points deviate from the mean. Like the standard deviation, a higher variance indicates more variability in the data.

In [57]:
print(reviews_df['Rating'].var())

1.8394736842105261
