# Exploring Data in Pandas

> Pandas `DataFrames` have a wide range of tools that allow you to explore and summarise your dataset.  In this lesson, we will look at how you can use Pandas to examine the structure of your data. This is a crucial step in any data analysis process, and is often the first step in the Exploratory Data Analysis (EDA) process. It can identify issues that you might subsequently need to address by cleaning or transformation.

## Loading in an Example Dataset

Let's begin by loading in an example dataset to work with. Run the cell below to import the data.

In [None]:
import pandas as pd
salary_df = pd.read_csv('https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv')


## Viewing Data

>When you have read in a large dataset, it can be useful to look at a small subset of the data to learn something about its structure: the number of columns, the kind of information they contain, and whether or not the `DataFrame` loaded as you expected it to.

After loading in your data, it is typically a good idea to begin by viewing the first few rows using the `.head()` method.

In [None]:
salary_df.head(4)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,


You can also view the last few rows using the `.tail()` method:

In [None]:
salary_df.tail(2)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
674,ZY,ELIZABETH AGUILAR-TARCHI,"HEAD ATTORNEY, CIVIL AND CRIMINAL",176856.17,0.0,3537.11,,180393.28,180393.28,2011,,San Francisco,
675,ZZ,JULIAN NG,SERGEANT III (POLICE DEPARTMENT),130457.76,43793.11,6061.8,,180312.67,180312.67,2011,,San Francisco,


## Describing the Data

> Once you have checked the overall structure of the data, another useful step is to get some descriptive information about the columns in the `DataFrame`. Viewing all of the data in its entirety can be confusing and daunting, whereas descriptive statistics about the data are much easier to parse and understand. Pandas has a number of tools for this purpose, including `.info()`, `.describe()` and `.value_counts()`.


### Finding the Shape of the `DataFrame`

You can find out the shape of a ``DataFrame`` (the number of rows and columns) using the `.shape` attribute:

In [None]:
shape = salary_df.shape
print(f'This dataset has {shape[0]} rows and {shape[1]} columns')

This dataset has 676 rows and 13 columns


### The `.info()` Method
This method is useful because it will tell you the following information:

- The number of rows in your dataset
- The amount of memory your `DataFrame` object uses
- The data type and number of non-null values in each column

We will learn more about data types and handling null values in a later lesson. But for now a quick teaser question: are there any columns in this dataset you would consider removing from the analysis, based on the output of calling the  `.info()` method in the code block below?

In [None]:
salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676 entries, 0 to 675
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Id                675 non-null    object 
 1   EmployeeName      676 non-null    object 
 2   JobTitle          676 non-null    object 
 3   BasePay           676 non-null    float64
 4   OvertimePay       676 non-null    float64
 5   OtherPay          676 non-null    float64
 6   Benefits          0 non-null      float64
 7   TotalPay          676 non-null    float64
 8   TotalPayBenefits  676 non-null    float64
 9   Year              676 non-null    int64  
 10  Notes             0 non-null      float64
 11  Agency            676 non-null    object 
 12  Status            0 non-null      float64
dtypes: float64(8), int64(1), object(4)
memory usage: 68.8+ KB


### The `.describe()` Method
> The `.describe()` method provides a quick statistical summary of a `DataFrame`. It's useful for getting an overview of numerical columns, including count, mean, standard deviation, min/max values, and quartile ranges.

Run the code block below to see an example. Note that it only returns those columns with a numeric datatype.

In [None]:
salary_df.describe()

Unnamed: 0,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Status
count,676.0,676.0,676.0,0.0,676.0,676.0,676.0,0.0,0.0
mean,149900.019867,30577.231657,23448.147426,,203925.39895,203925.39895,2011.0,,
std,41837.73621,35124.001536,30720.629073,,31798.561906,31798.561906,0.0,,
min,25400.0,0.0,0.0,,180312.67,180312.67,2011.0,,
25%,117268.875,0.0,7006.3175,,185695.47,185695.47,2011.0,,
50%,144042.16,16664.63,16491.805,,194842.065,194842.065,2011.0,,
75%,184727.1325,57995.9625,27305.23,,209450.04,209450.04,2011.0,,
max,294580.02,245131.88,400184.25,,567595.43,567595.43,2011.0,,


### The `.value_counts()` Method

>The `.value_counts()` method in Pandas is used to count the frequency of unique values in a column. It's particularly useful for understanding the distribution of categorical data, highlighting the most common and least common values in a `Series`.

In [None]:
salary_df['Year'].value_counts()

2011    676
Name: Year, dtype: int64

From running the code block above, we now know that every entry in the `Year` column is the same value, `2011`. This would have been hard to ascertain from looking at the raw data.

## Sorting Data


>Sorting data in Pandas is done using the `sort_values()` and `sort_index()` methods. This organises the data by column values or index, aiding in data analysis and visualisation.

Sorting by a single column can be achieved using `sort_values()`, by specifying the column in the `by` parameter. Perhaps surprisingly, by default it sorts in ascending order, so to sort in descending order you need to specify `ascending = False` as well. 

- **Numeric** columns are sorted numerically
- **Text** columns are sorted alphabetically

In [None]:
sorted_salary_df=salary_df.sort_values(by=['BasePay'],ascending=False)
sorted_salary_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
12,AM,EDWARD HARRINGTON,EXECUTIVE CONTRACT EMPLOYEE,294580.02,0.0,0.0,,294580.02,294580.02,2011,,San Francisco,
9,AJ,JOANNE HAYES-WHITE,"CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)",285262.0,0.0,17115.73,,302377.73,302377.73,2011,,San Francisco,
13,AN,JOHN MARTIN,DEPARTMENT HEAD V,271329.03,0.0,21342.59,,292671.62,292671.62,2011,,San Francisco,
16,AQ,AMY HART,DEPARTMENT HEAD V,268604.57,0.0,16115.86,,284720.43,284720.43,2011,,San Francisco,
28,BC,DENISE SCHMITT,DEPUTY CHIEF III (POLICE DEPARTMENT),261717.6,0.0,2357.0,,264074.6,264074.6,2011,,San Francisco,


You can also sort by multiple columns simultaneously. It will sort by the earliest listed column first:

In [None]:
sorted_salary_df=salary_df.sort_values(by=['JobTitle','BasePay'],ascending=[True,False])
sorted_salary_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
36,BK,SUSAN CURRIN,"ADMINISTRATOR, SFGH MEDICAL CENTER",245124.44,0.0,12000.0,,257124.44,257124.44,2011,,San Francisco,
59,CH,AI-KYUNG CHUNG,ANESTHETIST,214745.44,9161.31,14972.45,,238879.2,238879.2,2011,,San Francisco,
96,DS,SARAH CARY,ANESTHETIST,208925.6,5539.3,12615.6,,227080.5,227080.5,2011,,San Francisco,
110,EG,MARK SMITH,ANESTHETIST,206057.69,3431.01,10921.33,,220410.03,220410.03,2011,,San Francisco,
121,ER,SHELLEY MITCHELL,ANESTHETIST,203658.55,4759.36,9561.44,,217979.35,217979.35,2011,,San Francisco,


The `sort_index()` method can be used to sort values by an index column, or by any list of the same length as the number of rows in the `DataFrame`. By default, it uses the primary index column of the `DataFrame`.

In the following trivial example, we use the original index values to re-sort `sorted_salary_df` back to its original state:

In [None]:
sorted_salary_df=salary_df.sort_index()
sorted_salary_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


## Filtering Data

>Filtering data in Pandas involves selecting rows based on specific conditions. This can be done using logical indexing, where conditions are set on one or more columns to isolate subsets of data. Effective filtering allows for targeted analysis of a subset of the data.
There may be times when you wish to separate out sections of your data, based on the value taken by a specific column. For example if you are working with a `DataFrame` of customers for an international business, you might want to separate out just those customers where the `country` column contains the value `Switzerland`. Alternatively you might want to analyse the behaviour of customers within a certain age range. These are both situations where **logical indexing** is useful.

## Indexing with a Logical Mask

You have already met the `loc` attribute in a previous lesson, in the context of selecting a row based on a value. You can also use it to select a row based on a logical expression. 

Let's take the example of selecting the entries in a `customers` `DataFrame` where the country matches `Switzerland`. We use the `loc` attribute to select rows in which a logical expression is evaluated to `True`. The syntax to assign matches to a new `DataFrame` is as follows:

`new_df = df.loc[<your logical expression>]`

Run the code block below to see this in action:

In [None]:
# Creating a sample dataframe
data = {
    'Name': ['Alansana', 'Briana', 'Chanmony', 'Dietrich', 'Eva'],
    'Age': [23, 34, 45, 36, 50],
    'Country': ['USA', 'Switzerland', 'UK', 'Switzerland', 'Canada']
}
customers = pd.DataFrame(data)

# Displaying the original dataframe
print("Original DataFrame:")
print(customers)

# Using `loc` to select rows based on a logical expression
swiss_customers = customers.loc[customers['Country'] == 'Switzerland']

# Displaying the dataframe with only Swiss customers
print("\nSwiss Customers:")
print(swiss_customers)

Original dataframe:
       Name  Age      Country
0  Alansana   23          USA
1    Briana   34  Switzerland
2  Chanmony   45           UK
3  Dietrich   36  Switzerland
4       Eva   50       Canada

Swiss Customers:
       Name  Age      Country
1    Briana   34  Switzerland
3  Dietrich   36  Switzerland


### Selecting Based on Multiple Values
If you want to select multiple values from your column for a subset of the data, you can use the `.isin()` method to achieve this. The `isin()` method returns `True` if a row's value is a member of a list:

In [None]:

# Selecting rows where 'Country' is either 'USA' or 'UK'
selected_countries = ['USA', 'UK']
mask = customers['Country'].isin(selected_countries)

# Using the mask to get the subset of the DataFrame
subset_df = customers[mask]

# Displaying the subset of the DataFrame
print("\nSubset DataFrame (Countries: USA, UK):")
print(subset_df)



Subset DataFrame (Countries: USA, UK):
       Name  Age Country
0  Alansana   23     USA
2  Chanmony   45      UK


### Advanced Logical Filtering

It is possible to select rows based on arbitrarily complex logical masks, provided the output of the logical expression is a `Series` of Boolean values. 

Let's look at a more complex example. We will select all users whose base pay is greater than `$50,000`, and who work in a police department.

We can use the `str.contains()` method to find if a given row of the `JobTitle` column contains the word `police`, and use the `>` logical operator to handle the salary thresholding:

In [None]:
mask = (salary_df['JobTitle'].str.contains('police', case=False)) & (salary_df['BasePay'] > 50000)
print(f'The datatype of the mask is {type(mask)}')
mask.head(5)



The datatype of the mask is <class 'pandas.core.series.Series'>


0    False
1     True
2     True
3    False
4    False
dtype: bool

In [None]:
filtered_df = salary_df[mask] # Applying the mask to the DataFrame
filtered_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
11,AL,PATRICIA JACKSON,CAPTAIN III (POLICE DEPARTMENT),99722.0,87082.62,110804.3,,297608.92,297608.92,2011,,San Francisco,
15,AP,RICHARD CORRIEA,"COMMANDER III, (POLICE DEPARTMENT)",198778.01,73478.2,13957.65,,286213.86,286213.86,2011,,San Francisco,
25,AZ,GREGORY SUHR,CHIEF OF POLICE,256470.41,0.0,11522.18,,267992.59,267992.59,2011,,San Francisco,


## Key Takeaways

- Pandas `DataFrame`s provide tools for exploring and summarising data, crucial for initial steps in Exploratory Data Analysis
- Load datasets in Pandas using the `pd.read_csv()` function
- Use `.head()` and `.tail()` methods to view the first and last rows of a `DataFrame` in Pandas
- Use the `.shape` attribute to determine the number of rows and columns in a `DataFrame`
- Use Pandas' `.info()`, `.describe()`, and `.value_counts()` to get descriptive information about `DataFrame` columns
- The `.info()` method in Pandas provides row count, memory usage, and non-null values per column
- The `.describe()` method in Pandas provides a statistical summary of numerical columns in a `DataFrame`
- Use `.value_counts()` in Pandas to count frequency of unique values in a column
- Use `sort_values()` to sort Pandas data by column values and `sort_index()` to sort by index, specify `ascending=False` for descending order
- Filtering in Pandas uses logical indexing to select data based on specific conditions for targeted analysis
- Use the `loc` attribute to select rows in a `DataFrame` based on a logical expression
- Use the `.isin()` method to select multiple values from a column in a Pandas `DataFrame`
- Use logical masks with Pandas to filter data based on complex conditions