Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email [rdb104@case.edu](mailto:rdb104@case.edu).<br />
___

# Exploratory Data Analysis in Python

**Description:** This lesson introduces the basic data import and simple assessment processes using the `pandas` library for Python.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 30 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** csv files

**Data Format:** `csv`, `py` 

**Libraries Used:** `pandas` 
___

## Introduction
Welcome to your first EDA lesson.  This project will use the `pandas` package to introduce the basic EDA workflow.

We will be using [2014 Adult Census Data](https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv) as our example for this tutorial.  The csv file is taken from this [Kaggle Dataset](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download).  

In this project you will:
1. Learn how to import the necessary python libraries.
2. Use the `read_csv` function in pandas to import a csv file.
3. Discover the size of the dataset.
4. Discover the types of data in the dataset. 
5. Learn how to modify data types to improve performance, accuracy and scalability.



## Ultra Quick Jupyter Notebook Tips

These Jupyter Notebooks have markdown cells, which you just read, and code cells, which you can run and edit.  

Code cells appear in the light grey box.  You can click on the text in the code cell to edit it.  You can run a code cell by clicking the Run button at the top of the page (pictured) or by clicking Shift + Enter after clicking on the cell.

![The Run Button](img/runcellbutton.png) 

Finally, all the code cells in a notebook must be run in order.  Make sure you start at the top of each lesson and run each cell in order.

### Importing data

We'll be using the `pandas` package for most of our exploratory data analysis. Before we can use a Python package, we need to import it. We’ll import `pandas as pd`, allowing us to use the shorter `pd` instead of typing `pandas` each time we call its functions. While this may seem like a small change, it's a common practice in Python, so it's helpful to be familiar with it.

Run the cell below to import the pandas package.

In [None]:
import pandas as pd #https://pandas.pydata.org/docs/

With the `pandas` package imported, we can use its powerful functions to bring data into our workspace. Our data is in a CSV file—a widely supported format that's easy for humans to read and ideal for data analysis and maintenance. CSV files also allow for efficient storage of large datasets. If you're interested, you can read more about CSV files [here](https://en.wikipedia.org/wiki/Comma-separated_values) if you are curious.

Our data can be found at this URL: (https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv)  This dataset, originally from the 2014 U.S. Census, was compiled by Ronny Kohavi and Barry Becker and is hosted on [Kaggle](https://www.kaggle.com/datasets/uciml/adult-census-income/data).  

To import the CSV file, we'll use the `pd.read_csv` function. This function simply requires the file's location, which in our case is the URL above. You can also use `read_csv` to load data from a file path on your computer. For more details on this function, see the official documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In [None]:
# URL of the CSV file
url = 'https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv'

# Read the CSV file into a DataFrame
adultCensus = pd.read_csv(url)

In Python, the functions `head()`, `info()`, and `describe()` are frequently used for preliminary data exploration and analysis. 

Here's a quick breakdown of the usage and purpose of each one:

1. head()

    Purpose: Displays the first few rows of a DataFrame or Series.
    Usage: DataFrame.head(n)
    Parameter: n (optional) - the number of rows to display (default is 5).

2. info()

    Purpose: Provides a concise summary of a DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.  Helps in understanding the structure of the DataFrame, checking for null values, and confirming data types.
    Usage: DataFrame.info()

3. describe()

    Purpose: Generates descriptive statistics of numerical columns in a DataFrame, including count, mean, standard deviation, min, 25th percentile, median (50%), 75th percentile, and max.  Useful for quickly getting insights into the distribution and summary statistics of the data.
    Usage: DataFrame.describe()

Try running all three functions in the code cells below.  What did you learn about our dataset after using these functions?

In [None]:
# Display the first five rows of the DataFrame
adultCensus.head()

In [None]:
#Display summary statistics for the numerical columns in a DataFrame.
adultCensus.describe()

In [None]:
 # Display a concise summary of the DataFrame, showing information about each column, including data types, non-null counts, and memory usage. 
adultCensus.info()