Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email [rdb104@case.edu](mailto:rdb104@case.edu).<br />
___

# Exploratory Data Analysis in Python

**Description:** This lesson introduces the basic data import and simple assessment processes using the `pandas` library for Python.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 15 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** csv files

**Data Format:** `csv`, `py` 

**Libraries Used:** `pandas` 
___

## Introduction
Welcome to your first EDA lesson.  This project will use the `pandas` package to introduce the basic EDA workflow.

We will be using [2014 Adult Census Data](https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv) as our example for this tutorial.  The csv file is taken from this [Kaggle Dataset](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download).  

In this project you will:
1. Learn how to import the necessary python libraries.
2. Use the `read_csv` function in pandas to import a csv file.
3. Discover the size of the dataset.
4. Discover the types of data in the dataset. 
5. Learn how to modify data types to improve performance, accuracy and scalability.



## Ultra Quick Jupyter Notebook Tips

These Jupyter Notebooks have markdown cells, which you just read, and code cells, which you can run and edit.  

Code cells appear in the light grey box.  You can click on the text in the code cell to edit it.  You can run a code cell by clicking the Run button at the top of the page (pictured) or by clicking Shift + Enter after clicking on the cell.

![The Run Button](img/runcellbutton.png) 

Finally, all the code cells in a notebook must be run in order.  Make sure you start at the top of each lesson and run each cell in order.

### Importing data

We'll be using the `pandas` package for most of our exploratory data analysis. Before we can use a Python package, we need to import it. We’ll import `pandas as pd`, allowing us to use the shorter `pd` instead of typing `pandas` each time we call its functions. While this may seem like a small change, it's a common practice in Python, so it's helpful to be familiar with it.

Run the cell below to import the pandas package.

In [None]:
import pandas as pd #https://pandas.pydata.org/docs/

With the `pandas` package imported, we can use its powerful functions to bring data into our workspace. Our data is in a CSV file—a widely supported format that's easy for humans to read and ideal for data analysis and maintenance. CSV files also allow for efficient storage of large datasets. If you're interested, you can read more about CSV files [here](https://en.wikipedia.org/wiki/Comma-separated_values) if you are curious.

Our data can be found at this URL: (https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv)  This dataset, originally from the 2014 U.S. Census, was compiled by Ronny Kohavi and Barry Becker and is hosted on [Kaggle](https://www.kaggle.com/datasets/uciml/adult-census-income/data).  

To import the CSV file, we'll use the `pd.read_csv` function. This function simply requires the file's location, which in our case is the URL above. You can also use `read_csv` to load data from a file path on your computer. For more details on this function, see the official documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In [None]:
# URL of the CSV file
url = 'https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv'

# Read the CSV file into a DataFrame
adultCensus = pd.read_csv(url)

In Python, the functions `head()`, `info()`, and `describe()` are frequently used for preliminary data exploration and analysis. 

Here's a quick breakdown of the usage and purpose of each one:

1. head()

    Purpose: Displays the first few rows of a DataFrame or Series.
    Usage: DataFrame.head(n)
    Parameter: n (optional) - the number of rows to display (default is 5).

2. info()

    Purpose: Provides a concise summary of a DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.  Helps in understanding the structure of the DataFrame, checking for null values, and confirming data types.
    Usage: DataFrame.info()

3. describe()

    Purpose: Generates descriptive statistics of numerical columns in a DataFrame, including count, mean, standard deviation, min, 25th percentile, median (50%), 75th percentile, and max.  Useful for quickly getting insights into the distribution and summary statistics of the data.
    Usage: DataFrame.describe()

Try running all three functions in the code cells below.  What did you learn about our dataset after using these functions?

In [None]:
adultCensus.head()

In [None]:
 # Display the first few rows of the DataFrame
adultCensus.describe()

In [None]:
 # Display the first few rows of the DataFrame
adultCensus.info()

### Types of Data

After running the `info()` function, you'll see that our dataset has 15 columns and 32,561 rows.

The output of `info()` also shows the data type (Dtype) of each column. Understanding these data types is essential for data cleaning, transformation, and optimizing operations in pandas. A pandas DataFrame can contain various data types in its columns, and knowing these types helps with effective data manipulation.

While working with smaller datasets or practicing exploratory data analysis (EDA) and visualization, optimization may not be crucial. However, converting columns to appropriate data types is a best practice that offers several advantages.  For example, Many `pandas` operations are optimized for certain data types. Categorical data types are more efficient for string-based columns with limited, repeated values, improving speed and reducing memory usage. Memory efficiency, improved performance, and data integrity are some other benefits.  

Using appropriate data types is a habit that can lead to better performance when working with large or complex datasets, so adopting these habits now will not only make your code more useful when you move on to larger projects, but it will also be helpful to others who may be working on more complex projects.

So to get started with data types, let's run the code below to get another look at the data types in our data.

In [None]:
# Show the data type of each column in the adultCensus dataframe 
adultCensus.dtypes

You can see that we have only two data types, `object`, a data type for strings of characters or mixed data, and `int64`, a numeric data type for whole numbers.  

You can see above that we have several columns with the `object` data type.  This data type is commonly used as a catch-all and seeing so many columns with this data type should raise a red flag.  

If we're going to change data types, we should know what our options are.  Here’s an overview of the main data types you might encounter in a DataFrame.  There are a few more, but these are the most common.

1. Numeric Types

    int64 / int32: Integer data types, used for columns with whole numbers.
    Example: 1, 10, -42
    float64 / float32: Floating-point data types, used for columns with decimal numbers.
    Example: 3.14, -0.001, 2.5

2. Object Type

    object: A catch-all data type, typically used for columns that contain strings or mixed data types.
    Example: 'apple', '123 Main St', 'true'
    Note: This type can include any Python object and is commonly used when data contains text or heterogeneous types. It is generally less efficient than specific types like string.

3. String Type

    string: Explicit type for columns with text data. While object can hold strings, using string ensures more consistent handling.
    Example: 'Hello, world!', 'data science'
    Note: This type is useful for text processing and can be more optimized than using object for string data.

4. Boolean Type

    bool: Represents columns with True or False values.
    Example: True, False, True
    Use Case: Commonly used in conditions, filters, and logical operations.

5. Datetime Type

    datetime64: Used for columns containing date and time information.
    Example: 2023-11-04 15:45:00, 2024-01-01
    Use Case: Facilitates operations like date arithmetic, filtering by dates, and time-series analysis.

6. Timedelta Type

    timedelta64: Represents differences or durations between dates and times.
    Example: 3 days 00:00:00, -1 days +23:00:00
    Use Case: Useful for measuring time spans or differences between date columns.

7. Category Type

    category: Represents data that is limited to a fixed number of distinct values, similar to enumerations.
    Example: ['Red', 'Blue', 'Green']
    Use Case: Optimizes memory usage and performance when working with columns that have repeated discrete values, like status labels or ratings.
    Advantage: Reduces memory usage compared to object and speeds up operations involving comparisons.


### Exploration, doing some counting...

So. We've got many columns in our data frame with the data type `object.`  Many of these seem like the data would be better classified as the `category` data type.  `Category` data has a fixed number of possible values.   `marital.status`, `race`, `sex` seem like they would be good candidates for the `category` data type.  `marital.status` might contain values like "Married," "Single," "Divorced," etc. `race` could include values like "White," "Black," "Asian," etc. `sex` typically has values like "Male" and "Female."

One way to check if these fields are a good candidate for the `category` data type is to check for the number of unique values.  `Pandas` has a method specifically for this aspect of EDA, `nunique()`.  

Run the code cell below to get a count of unique values for each column.

In [None]:
adultCensus.nunique()

So we can see that we were right about the `marital.status`, `race`, and `sex` columns in our dataframe.  Out of ~32k rows, there are less that 10 unique values in all three columns.  These are defintely good candidates for the `category` data type.  

However! We've also discovered that `workclass`, `education`, `occupation`, `relationship`, and `income`have very few unique values as well.  `income` is the one that is really surprising here.  Do you agree?  Why are there only 2 unique values?

Let's use the `head()` method from `pandas` again to take another look at the data.  By default, it shows the top 5 rows, but you can specify the number of rows you’d like to see by passing an integer as an argument to head().  The code cell below has the argument 30 in the parentheses, so we will see the firsrt 30 rows.  

Run the code cell below.  What do you see in the `income` field?

In [None]:
adultCensus.head(30)

We see that the `income` field is just an indicator that tells us whether or not the iondividual makes more than or less than fifty thousand dollars.  So this field is also a perfect candidate for the `category` datatype. 

Before we convert these fields of our dataframe to the new datatype, let's review why this is important.

Memory Efficiency: The category type stores unique values only once and references them with an integer, which is more efficient than storing strings. This reduces memory usage, which is especially beneficial for large datasets with repeated values in these columns.

Performance Improvement: Operations like grouping, filtering, or comparisons can be faster with categorical data because pandas can optimize them based on the limited number of unique values.

Logical Representation: By converting these columns to category types, you indicate that these fields represent qualitative, not quantitative, information. This makes the code more readable and the data structure clearer.

In [None]:
df['marital.status'] = df['marital.status'].astype('category')

The 

In [None]:
results.text

### Object

Now

In [None]:
with open('scrape.txt','w') as file:
    file.write(results.text)

Try 

In [None]:
with open('scrape.txt','w') as file:
    file.write(results.text)