# **Pandas**

### **Learning Objectives**
- Learn about Pandas' two main data structures, Series and DataFrame, and how to perform basic operations like selecting, filtering, and sorting data.
- Gain skills in using pandas for basic exploratory data analysis, including summarizing data, calculating descriptive statistics, and visualizing data distributions.
- Acquire techniques for handling missing data, removing duplicates, and transforming data using pandas' built-in functions and methods.

## Warm Up Exercise

**Activity**: SQL importing
**Instructions**: Import the SQLite Database we created previously.
1. Import SQLite into the notebook.
2. Connect to the database.
3. Do a select statement to check if our data is still accessible.

In [1]:
# Activity Space

### Pandas Installation
In order to use Pandas, we first need to install it.

```pip install pandas```

In [2]:
import pandas as pd

# DataFrames
A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types, similar to a spreadsheet or SQL table. 
It provides a powerful and flexible way to manipulate and analyze structured data in Python, offering functionalities for data analysis.

In [3]:
# Creating an Empty DataFrame

df = pd.DataFrame()
df

In [4]:
# Creating a DataFrame using a list of lists

data = [['Alice', 23], ['Mike', 34], ['Jetson', 28]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
df

Unnamed: 0,Name,Age
0,Alice,23
1,Mike,34
2,Jetson,28


In [5]:
# Creating a DataFrame using a dictionary of lists

data = {
    'Name': ['Alice', 'Mike', 'Jetson'], 
    'Age': [23, 34, 28]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,23
1,Mike,34
2,Jetson,28


In [6]:
# Creating a DataFrame using a list of dictionaries

data =  [{'name': 'Alice', 'age': 23}, {'name': 'Mike', 'age': 34}, {'name': 'Jetson', 'age': 28}]

df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,Alice,23
1,Mike,34
2,Jetson,28


# Series
Certainly! A pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, etc.). It's similar to a one-column table or an array with associated labels, providing powerful indexing and manipulation capabilities in Python.

In [7]:
s = pd.Series([1,3,5,6,7,8])
s

0    1
1    3
2    5
3    6
4    7
5    8
dtype: int64

In [8]:
# Creating a range/series of dates
dates = pd.date_range('20240101',periods=6)
dates

DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04',
               '2024-01-05', '2024-01-06'],
              dtype='datetime64[ns]', freq='D')

**Example: Sales Data Analysis**

You have a dataset of sales transactions that includes the product name, quantity sold, and sale price. 
You want to analyze the data to find the total revenue per product.

In [9]:
data = {
    'Product Name':['A','B','C','A','B','A'],
    'Quantity Sold':[3,2,5,4,1,2],
    'Sale Price':[10,20,10,15,20,15]
}

sales_df = pd.DataFrame(data)
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price
0,A,3,10
1,B,2,20
2,C,5,10
3,A,4,15
4,B,1,20
5,A,2,15


In [10]:
sales_df['Total Revenue'] = sales_df['Quantity Sold'] * sales_df['Sale Price']
sales_df

total_revenue = sales_df.groupby('Product Name')['Quantity Sold'].sum()
total_revenue

Product Name
A    9
B    3
C    5
Name: Quantity Sold, dtype: int64

### **Data Selection**

pandas provides numerous methods for selecting and indexing data in Series and DataFrames, including label-based indexing with .loc, integer-position based indexing with .iloc, and conditional selection.

In [11]:
# Recalling our Sales DataFrame
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60
4,B,1,20,20
5,A,2,15,30


In [12]:
# Check the First Two Rows of Product Names
sales_df['Product Name'][0:2]

0    A
1    B
Name: Product Name, dtype: object

In [13]:
# Check the First Two Rows of Quantity Sold
sales_df['Quantity Sold'][0:2]

0    3
1    2
Name: Quantity Sold, dtype: int64

In [14]:
# Sum method for totaling
sales_df['Quantity Sold'][0:2].sum()

5

In [15]:
"""
Index Location (.iloc)
- Will get rows based on a number/index.
- Will output into a DataFrame instead of a Series.
""" 

sales_df.iloc[0:3]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
1,B,2,20,40
2,C,5,10,50


In [16]:
"""
Location (.loc)
- Access a group of rows and columns by label(s) or a boolean array.
"""

# Get only specific columns
sales_df.loc[0:3, ['Product Name', 'Sale Price']]

Unnamed: 0,Product Name,Sale Price
0,A,10
1,B,20
2,C,10
3,A,15


In [17]:
# Conditional Filtering

# Checking for Total Revenues greater than or equal to 40
sales_df[sales_df['Total Revenue'] >= 40]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
1,B,2,20,40
2,C,5,10,50
3,A,4,15,60


In [18]:
# Checking for Product Names equal to A
sales_df[sales_df['Product Name'] == "A"]

Unnamed: 0,Product Name,Quantity Sold,Sale Price,Total Revenue
0,A,3,10,30
3,A,4,15,60
5,A,2,15,30


**Example: Filtering Customer Reviews**

A DataFrame contains customer reviews for different products, including a numeric rating. You need to filter reviews to find all reviews of a specific product with a rating of 4 or higher.

In [19]:
reviews_data = {
    'ProductID': ['P1','P2','P3','P4','P5'],
    'Rating': [5,3,2,1,4]
}

reviews_df = pd.DataFrame(reviews_data)

In [20]:
reviews_df[reviews_df['Rating']>=4]

Unnamed: 0,ProductID,Rating
0,P1,5
4,P5,4


### **Importing and Exporting Data**

Pandas supports reading from and writing to a variety of file formats, including CSV, Excel, SQL, and JSON, making it easy to integrate with data analysis workflows.

**Example: Importing Sales Data and Exporting Analysis**

Assume you have a CSV file, sales.csv, containing sales data. You need to import this data, calculate the total sales, and export the summary to a new file.

Tip: Install **openpxyl** for excel exporting!

```pip install openpyxl```

In [21]:
# Turns CSV into a DataFrame
data = pd.read_csv('./data/example.csv')
data

Unnamed: 0,A,B,C
0,1.0,5.0,10.0
1,2.0,6.5,11.0
2,2.333333,6.5,12.0
3,4.0,8.0,11.0


In [22]:
# Exports the DataFrame to an Excel file.
data.to_excel('./exports/exported_data.xlsx', sheet_name='Sheet1', index=False)

**Example: Importing Sales Data and Exporting Analysis**

Assume you have a CSV file, sales.csv, containing sales data. You need to import this data, calculate the total sales, and export the summary to a new file.

In [23]:
# Importing Sales into the notebook
sales_df = pd.read_csv('./data/sales.csv')
sales_df

Unnamed: 0,Product Name,Quantity Sold,Sale Price
0,Product A,3,10
1,Product B,2,20
2,Product A,5,10
3,Product C,4,15
4,Product B,1,20
5,Product C,2,15


In [24]:
# Getting the total revenue of the store

# Multiplying the price with how much was sold
sales_df['Total'] = sales_df['Quantity Sold'] * sales_df['Sale Price']

# Adding all totals together
total_revenue = sales_df['Total'].sum()

total_revenue

230

In [25]:
# Create a DataFrame for Total Sales
summary_df = pd.DataFrame([{'Total Sales': total_revenue}])

# Export to CSV file
summary_df.to_csv('./exports/sales_summary.csv',index=False)

### **Working with Text Files**

Besides structured data formats like CSV and Excel, pandas can also work efficiently with plain text files, allowing for text data manipulation and preprocessing.

In [26]:
# File, cell separator (i.e. \t for TAB)
df_txt = pd.read_csv('./data/example.txt',sep='\t')
df_txt

Unnamed: 0,ProductID,ProductName,Category,Price,UnitsSold
0,1,Coffee Maker,Appliance,99.99,150
1,2,Tea Kettle,Kitchen,24.99,200


**Example: Analyzing Text Data**

Suppose you have a text file, reviews.txt, containing tab-separated values of product reviews (ProductID, ReviewText). You aim to add a column indicating the review length and save the enriched data.

In [27]:
# Import reviews.txt data, and converting it to a DataFrame
df_txt = pd.read_csv('./data/reviews.txt',sep='\t')
df_txt

Unnamed: 0,ProductID,ReviewText
0,P1,This product is great!
1,P2,Could be better.
2,P1,Loved it!
3,P3,Not what I expected.
4,P2,Amazing product.


In [28]:
# Creating a new column called Review Length by getting the length of review texts

# .apply applies a function to a Column, in this case, applying length
df_txt['Review Length'] = df_txt['ReviewText'].apply(len)

In [29]:
df_txt

Unnamed: 0,ProductID,ReviewText,Review Length
0,P1,This product is great!,22
1,P2,Could be better.,16
2,P1,Loved it!,9
3,P3,Not what I expected.,20
4,P2,Amazing product.,16


In [30]:
# Exporting to CSV
df_txt.to_csv('./exports/review_lengths.txt',sep='\t',index=False)