Outline

- Introduction to Pandas
- Loading installed packages
- Inspect data
- Explore plot types
- Data Selection and Filtering
- Data Manipulation
- Summary of Functions






# Notebook Instructions

1. Save a copy of the forked notebook to Google Drive (File >> Save a copy in Drive).  This is the only way you'll be able to save your changes.
2. Do the Python coding as directed below (look for the "Your Turn" sections).
3. When you have completed your coding save your notebook in Drive (File >> Save) and save it also to your GitHub repo (File >> Save a copy in github).
4. Navigate to your notebook file in GitHub and copy and submit the URL for the Canvas lab assignment.

# Introduction to Pandas

Pandas is a powerful data manipulation library in Python. It provides data structures and functions that make working with structured data more convenient and efficient compared to base Python.

Pandas introduces two primary data structures: **Series** and **DataFrames**.
- A Series is a one-dimensional collection of values ( the elements must be the same data type).
- A DataFrame is a two-dimensional data structure with columns of potentially different types. It is similar to a spreadsheet.

Pandas provides a wide range of functions and methods for data manipulation, cleaning, filtering, aggregation, merging, and more. It integrates well with other libraries in the data science ecosystem, such as Matplotlib for data visualization.

Some key advantages of using Pandas over base Python include:
- Efficient memory usage and performance for large datasets
- Intuitive syntax for data manipulation
- Built-in functions for common data operations (e.g., filtering, grouping, reshaping)
- Seamless integration with other data science libraries
- Handling of missing data and data alignment
- Powerful tools for data exploration and analysis

Throughout this lab, we will explore various aspects of Pandas and its capabilities for data manipulation, visualization and analysis.



# Summarize a  DataFrame

## Load Libraries

In this class we will be using
- Pandas
- Matplotlib

In [None]:
import pandas as pd
import matplotlib as mpl
import statsmodels.api as sm # To get some data


## Getting Data into Pandas

In this case we will load data from the statsmodels.org library. `mtcars` is a common practice dataset.


In [None]:
# Download data from the statsmodels API
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data

# Define dataset as a pandas dataframe
df = pd.DataFrame(mtcars)


## Preview Data

Here is a data dictionary:

|Attribute | Description |
|---------|-----------|
|  mpg    |Miles/(US) gallon|
|  cyl    |Number of cylinders|
|  disp   |Displacement (cu.in.)|
|hp   |Gross horsepower|
|drat |Rear axle ratio|
|wt   |Weight (1000 lbs)|
|qsec |1/4 mile time|
|vs   |Engine (0 = V-shaped, 1 = straight)|
|am   |Transmission (0 = automatic, 1 = manual)|
|gear |Number of forward gears|


In [None]:
#look at the top rows
df.head()


In [None]:
#look at last rows
df.tail(10)

In [None]:
#get a statistical summary of the dataset
df.describe()

## Pandas Syntax

Pandas is designed to be used primarily with dataframes.  When you see a syntax like `df.method()`, it means that you are calling a method on the DataFrame object `df`. The parentheses `()` after the method name indicate that it is a callable function.

So, above, we used `df.head()` etc, meaning that the `head()` function is being used on the DataFrame object, `df`.

## Your Turn: Summarize Data with Pandas

Download insurance data.  (I think Matt P showed you this dataset.)

In [None]:
idf = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/insurance.csv")



&rarr;  Display top and bottom rows of  `idf`.


&rarr;  What is the average expense? (Hint:  use the functions above.  Later on we'll find other ways of doing the calculation.)

&rarr;  What is the median age?

In [None]:
# Your code goes here


In [None]:
# Your code goes here


In [None]:
# Your code goes here


# Visualize Data


Pandas can create a variety of plots with the `df.plot(kind = "...")` syntax (where plotting is an attribute of the dataframe).

Here are examples.

## Line Chart

In [None]:
#Line chart
df['mpg'].plot(kind = "line", color='blue')

In [None]:
# Fix the  x-axis labels
df['mpg'].plot(kind = "line",
               color='blue',
               rot = 90,
               xlabel = "Car",
               ylabel = "Miles per gallon")


## Bar Chart

In [None]:
# Bar chart
df['mpg'].plot(kind = "barh",
               color='red',
               xticks=[0,10,20])

## Histogram

In [None]:
#Histogram
df['mpg'].plot(kind = "hist",
               bins=15,
               title='Miles Per Gallon')

## Boxplot

In [None]:
 # Boxplot
 df.plot(kind = "box", column = "mpg")

In [None]:
# Boxplot showing distribution of mpg at levels of cyl

df.plot(kind = "box", column = "mpg", by = "cyl")

## Scatterplot

In [None]:
#Scatter plot
df.plot(kind = "scatter", x = 'mpg',y = 'hp',c = 'wt')

### Your Turn: Visualizing Relationships

&rarr;  Using the `idf` dataset, make a plot showing the distribution of `bmi` by `sex`:

In [None]:
# Your code goes here


# Data Selection and Filtering

Data selection and filtering are essential techniques for working with Pandas DataFrames. They allow you to extract specific subsets of data based on certain conditions.



## Selecting Specific Columns
Square bracket subsetting, or indexing, is a way to access specific rows and columns of a DataFrame. When you use square brackets `[]` with a DataFrame, you can pass in:
- A single column name to select a single column (returns a Series)
- A list of column names to select multiple columns (returns a DataFrame)
- A *boolean mask* to select rows based on a condition
- A slice to select a range of rows

For example, `df['mpg']` selects the 'mpg' column as a Series, while `df[['mpg', 'cyl']]` selects the 'mpg' and 'cyl' columns as a DataFrame.



In [None]:
df['mpg']

In [None]:
df[['mpg', 'cyl']]

## Your Turn:  Filter columns

&rarr; Return just the `bmi` and `expenses` columns from `idf`.


In [None]:
# Your code goes here


## Filtering Rows with Boolean Masks

Boolean masking is a powerful technique for filtering rows based on certain conditions. You create a boolean mask by applying a condition to a DataFrame, which returns a series of True/False values. You can then use this mask to select the rows where the condition is True.


In [None]:
# Here is the mask:
df['mpg'] > 25

In [None]:
# Now, filter with the mask:
df[df['mpg'] > 25]





# Data Manipulation

Data manipulation involves modifying and transforming the data within a DataFrame. Pandas provides various functions and methods for data manipulation tasks.




## Creating New Columns
You can create new columns in a DataFrame by assigning values to a new column name. The values can be based on existing columns or derived from calculations

In [None]:
# Creating new columns based on existing columns
df['mpg_per_cyl'] = df['mpg'] / df['cyl']
df[['mpg', 'cyl', 'mpg_per_cyl']].head()

In this example, a new column 'mpg_per_cyl' is created by dividing the values of 'mpg' by 'cyl'.




## Your Turn: Create an Additional Chart

- Create a  chart showing the distribution of km/gal by transmission type (note 1 mile ≈ 1.60934 kilometers)


In [None]:
#Your code goes here


# Summary of Functions

Throughout this lab, we introduced several functions and methods provided by Pandas. Here's a summary of the key functions covered:

- `pd.DataFrame()`: Creates a DataFrame from a dictionary or other data source
- `df.head()`: Displays the first few rows of a DataFrame
- `df.tail()`: Displays the last few rows of a DataFrame
- `df.info()`: Provides a concise summary of a DataFrame
- `df.describe()`: Generates descriptive statistics of a DataFrame
- `df.plot(kind = "line")`: Creates a line chart
- `df.plot(kind = "barh")`: Creates a horizontal bar chart
- `df.plot(kind = "hist")`: Creates a histogram
- `df.plot(kind = "box")`: Creates a box plot
- `df.plot(kind = "scatter")`: Creates a scatter plot

