# Pandas 101

A comprehensive guide to getting started with Pandas for data analysis and manipulation.

In [None]:
import pandas as pd

## Basics

### DataFrames and Series

A DataFrame is a 2D representation of data with rows and columns. It's the primary data structure in Pandas.

In [None]:
df = pd.DataFrame({
    'name':['john','mary','peter','jeff','bill','lisa'],
    'age':[23,78,22,19,45,33],
    'sex': ['male','female','male','male','male','female']
})
df

A column in a DataFrame is called a Series - essentially a 1D array with labels.

In [None]:
ages = pd.Series([22,31,43,44,55])
ages

### Loading Data

We load data using the `read_*` methods, where the asterisk represents the file type. For example, `read_csv(filename)` for CSV files.

In [None]:
df = pd.read_csv('purchases.csv')
df

### Inspecting Data

We can inspect the data using several built-in DataFrame methods:
- `.head()` - displays the first 5 rows
- `.tail()` - displays the last 5 rows
- `.info()` - shows information about the DataFrame including column data types and non-null values
- `.describe()` - provides descriptive statistics about numerical columns
- `.shape` - returns the dimensions (rows, columns)
- `.columns` - lists all column names

In [None]:
print("Output of df.head():")
print(df.head())
print("\nOutput of df.tail():")
print(df.tail())
print("\nOutput of df.info():")
df.info()
print("\nOutput of df.describe():")
print(df.describe())
print("\nOutput of df.shape:")
print(df.shape)
print("\nOutput of df.columns:")
print(df.columns)

## Selecting and Filtering

### Selecting Columns

We can use column names to access columns using bracket notation: `df['column_name']`

In [None]:
df['product']

### Using iloc and loc

We can select or filter rows using `iloc` or `loc`:
- `iloc` is integer-location based indexing - it selects rows and columns based on their integer position
- `loc` is label-based indexing - it selects rows and columns based on their labels

In [None]:
print("Selecting the first row and first column using iloc:")
display(df.iloc[0,0])

print("\nSelecting the first three rows using iloc:")
display(df.iloc[0:3])

print("\nSelecting rows with index 1, 3, and 5 using iloc:")
display(df.iloc[[1, 3, 5]])

When filtering using both rows and columns with text labels, we use `loc`. For example, displaying the product name and quantity for products that cost more than 100:

In [None]:
display(df.loc[df['price']>100,['product','quantity']])

### Conditional Filtering

We can filter records using conditions. For example, finding products where prices are above 1000:

In [None]:
print(df[df['price'] > 1000])

We can combine multiple conditions using `&` (and) or `|` (or). Let's filter purchases where the product price is above 100 AND quantity is more than 1:

In [None]:
display(df[(df['price']>100)&(df['quantity']>1)])

### Sorting

We can sort values using the `.sort_values()` method. Let's sort products by highest quantity purchased:

In [None]:
display(df.sort_values('quantity',ascending=False))

## Data Cleaning and Transformation

### Handling Missing Values

We can handle missing values using various methods:
- `isna()` - finds missing values
- `fillna()` - imputes (fills) missing values
- `dropna()` - deletes records with missing values

In [None]:
newdf = pd.read_csv('purchases 2.csv')
newdf.info()
display(newdf)

Let's use `isna()` to find the null values:

In [None]:
newdf.isna().sum()

We can choose to impute the missing values with statistics like mean, median, or use forward fill or backward fill strategies.

Filling the null price values with mean:

In [None]:
newdf['price'] = newdf['price'].fillna(newdf['price'].mean())
newdf.info()

Filling the null quantity values with median:

In [None]:
newdf['quantity'] = newdf['quantity'].fillna(newdf['quantity'].median())
newdf.info()

Filling the missing product values with forward fill strategy:

In [None]:
newdf['product'] = newdf['product'].fillna(method='ffill')
newdf.info()

### Applying Functions

We can apply functions to columns using `apply()` and `map()` functions:
- `apply()` applies functions axis-wise (along columns or rows)
- `map()` applies functions element-wise

Lambda is a small anonymous function in Python. Its main use case is when we want to use a small function without needing to formally define it.

Let's double all quantities in the DataFrame:

In [None]:
newdf['quantity'] = newdf['quantity'].apply(lambda x: x*2)
newdf

Here's a simple example of a lambda function:

In [None]:
double = lambda x: x*2
print(double(5))

Let's add 500 to the price of products above 1000 in price using `map()`:

In [None]:
newdf['price'] = newdf['price'].map(lambda x: x+500 if x>1000 else x)
newdf

### String Operations

We can operate on textual data using the `.str` accessor. For example, let's check the length of each product name:

In [None]:
newdf['product'].str.len()

Other useful string functions using the `.str` accessor:

In [None]:
# Convert product names to uppercase
print("\nProduct names in uppercase:")
display(newdf['product'].str.upper().head())

# Convert product names to lowercase
print("\nProduct names in lowercase:")
display(newdf['product'].str.lower().head())

# Check if product name contains 'Mouse'
print("\nProducts containing 'Mouse':")
display(newdf[newdf['product'].str.contains('Mouse', na=False)])

# Replace 'Laptop' with 'Gaming Laptop'
print("\nReplacing 'Laptop' with 'Gaming Laptop':")
newdf['product'] = newdf['product'].str.replace('Laptop', 'Gaming Laptop')
display(newdf)

### Creating New Columns

We can create new columns using data from other columns. Let's create a column called 'total cost', calculated by multiplying price and quantity:

In [None]:
newdf['total cost'] = newdf['price']*newdf['quantity']
newdf

## Grouping & Combining Data

### GroupBy

We can group or combine data using the `groupby()` method. We use aggregation functions like `mean`, `max`, `min`, `count`, `sum` to assign a value to the grouping.

Let's see how many sales each product made by grouping by the product column and summing the quantity:

In [None]:
display(newdf.groupby('product')['quantity'].sum())

### Pivot Tables

Pivot tables are useful for summarizing and reorganizing data. They allow you to aggregate data based on one or more columns.

Let's create a pivot table to show the total sales of each product:

In [None]:
pivot_table = newdf.pivot_table(index='product', values='total cost', aggfunc='sum')
print("Pivot table showing total sales sold per product:")
display(pivot_table)

## Quick Visualization with Pandas

We can visualize datasets in Pandas using the `plot()` method. We can create line graphs, bar graphs, histograms, or scatter plots.

In [None]:
pivot_table.columns

In [None]:
pivot_table.plot(y='total cost', kind='bar')

Let's group by product and calculate the sum of quantity and total cost:

In [None]:
product_summary = newdf.groupby('product').agg({
    'quantity': 'sum',
    'total cost': 'sum'
}).reset_index()

display(product_summary)

### Using Matplotlib

We can use libraries like Matplotlib and Seaborn to perform more advanced visualizations. Let's create a scatter plot showing product count vs total cost:

In [None]:
import matplotlib.pyplot as plt

plt.scatter(product_summary['quantity'], product_summary['total cost'])
plt.xlabel('Product Count')
plt.ylabel('Total Cost')
plt.title('Product Count vs Total Cost')
plt.show()