Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email [rdb104@case.edu](mailto:rdb104@case.edu).<br />
___

# Exploratory Data Analysis in Python

**Description:** This lesson introduces the basic data import and simple assessment processes using the `pandas` library for Python.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 15 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** csv files

**Data Format:** `csv`, `py` 

**Libraries Used:** `pandas` 
___

## Introduction
Welcome to your first web scraping project.  This project will use the `requests` package to introduce the basic web scraping workflow.

We will be using [2014 Adult Census Data](https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv) as our example for this tutorial.  The csv file is taken from this [Kaggle Dataset](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download).  

In this project you will:
1. Learn how to import the necessary python libraries.
2. Use the `read_csv` function in pandas to import a csv file.
3. Discover the size of the dataset.
4. Discover the types of data in the dataset. 



## Ultra Quick Jupyter Notebook Tips

These Jupyter Notebooks have markdown cells, which you just read, and code cells, which you can run and edit.  

Code cells appear in the light grey box.  You can click on the text in the code cell to edit it.  You can run a code cell by clicking the Run button at the top of the page (pictured) or by clicking Shift + Enter after clicking on the cell.

![The Run Button](img/runcellbutton.png) 

Finally, all the code cells in a notebook must be run in order.  Make sure you start at the top of each lesson and run each cell in order.

### Importing data

We're going to be using the `pandas` package for most of our exploratory data analysis work.  Before we can begin using a Python package, we have to import it.  We will import `pandas as pd` so that when we have to call on fuctions from the package inour code we can juse use the shorter `pd` insteard of typing out `pandas` every time.  This may not seem like a big difference, but it is a common practice so it is a good idea to be aware of it in case you see code in the future using this sort ofshorthand for package names.  

Run the cell below to import the `pandas` package.  

In [None]:
import pandas as pd #https://pandas.pydata.org/docs/

Now that the `pandas` package has been imported we can use the various excellent functions for importing data that are built into the package. Our data is a csv file.  CSV is a very common data format that is widely supported. It is easy for humans to read, making data analysis and maintenance much simpler.  It can also store large amounts of data in relatively small file sizes.  You can learn more about csv files [here](https://en.wikipedia.org/wiki/Comma-separated_values) if you are curious.

Our data is stored here: (https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv)  This is a set of data pulled from the 2014 census by Ronny Kohavi and Barry Becker and posted on [Kaggle](https://www.kaggle.com/datasets/uciml/adult-census-income/data).  

To import our csv file we are going to use the `pd.read_csv` function that is built into `pandas`.  You can learn more about this function [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).  But right now, all we need to do is call the function and give it a the required parameter, a location of a file. In this case, the file location is a url.  `read_csv` will also work with a file path to a directory on your computer.

In [None]:
# URL of the CSV file
url = 'https://raw.githubusercontent.com/LibraryBeales/Exploratory_Data_Analysis/refs/heads/main/adult.csv'

# Read the CSV file into a DataFrame
adultCensus = pd.read_csv(url)



In Python, the functions head(), info(), and describe() are frequently used for preliminary data exploration and analysis. 

Here's a quick breakdown of the usage and purpose of each one:

1. head()

    Purpose: Displays the first few rows of a DataFrame or Series.
    Usage: DataFrame.head(n)
    Parameter: n (optional) - the number of rows to display (default is 5).

2. info()

    Purpose: Provides a concise summary of a DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.  Helps in understanding the structure of the DataFrame, checking for null values, and confirming data types.
    Usage: DataFrame.info()

3. describe()

    Purpose: Generates descriptive statistics of numerical columns in a DataFrame, including count, mean, standard deviation, min, 25th percentile, median (50%), 75th percentile, and max.  Useful for quickly getting insights into the distribution and summary statistics of the data.
    Usage: DataFrame.describe()

Try running all three functions in the code cells below.  What did you learn about our dataset after using these functions?

In [8]:
adultCensus.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [7]:
 # Display the first few rows of the DataFrame
adultCensus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [9]:
 # Display the first few rows of the DataFrame
adultCensus.describe()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


### Types of Data

After running the `info()` function, you can see that we have 15 columns and 32561 rows in our dataset.  You can also see the Dtype of each column in the output of the `info()` function.   Understanding these data types helps in data cleaning, transformation, and optimization of operations in pandas.  

In [14]:
adultCensus.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education.num        16
marital.status        7
occupation           15
relationship          6
race                  5
sex                   2
capital.gain        119
capital.loss         92
hours.per.week       94
native.country       42
income                2
dtype: int64

In [None]:
results.status_code

The response object has many other attibutes as well:
* .headers - access header information about the server that sent the response
* .text - access the response body as text for text-based responses, such as html, json, yaml, etc.
* .content - access the response body as bytes for nontext requests, such as images, spreadsheets, zipfiles, etc.
* .encoding - shows what text encoding `requests` is using in the `.text` attribute
* .url - shows the url that was used in the request, can be useful when encoding urls with various parameters
* .json - builtin in json decoder for scraping json files

Try viewwing these different attributes by editing the code cell below and running it again to see what you get.  All the attributes follow the same format, `results` where we stored our response object, a period, the the attribute, (i.e., `text`, `headers`, `url`)

For example, `results.text`


In [None]:
results.text

### Saving the Response Object 

Now that we have a response object and taken a quick look at its attributes, we need to save it to a file so that we can come back to it in the future and examine it.  We could scrape the web again each time we want to do some analysis, but that would be a waste of computing resources, and the content on the web is not static.  Social media posts are deleted.  Items are removed from shops.  The format of websites changes making your web scraper obsolete.  So it is best to scrape once, and save the data locally. Saving it also allows us to easily share the data we are using with others.

We are going to use a `with` statement to simplify the process of saving the response object.  The `with` statement in Python is used when you want to execute several operations as a group.  We will open the file, write the response object to it, and close the file and all these operations will happen within the `with` statement. You can learn more about `with` statments and writing to files here: https://www.freecodecamp.org/news/with-open-in-python-with-statement-syntax-example/

You can see the structure of the `with` statement in the code cell below.  

`with` begins the with statement.  
`open` tells the computer to open a file using the filename you specify, in this case, `scrape.txt`, and the `w` indicates we want to write to the file.  
`as` creates a variable that contains the file information, in this case, `file`.  
`file.write` will save whatever attributes of the response object you include in the parentheses, (i.e., `results.text`, `results.headers`, `results.url`)

The filename, `scrape.txt`, and the attribute can be changed to create different files with diffferent content.  

When you run the code below you will see the file appear in the list of directories/files on the left.  

In [None]:
with open('scrape.txt','w') as file:
    file.write(results.text)

Try changing the filename below to `scrapeheaders.txt` and the attribute below to `response.headers` and run the code cell again to create another file.  Did you get a differnt file with different content?

In [None]:
with open('scrape.txt','w') as file:
    file.write(results.text)