##Data Wrangling: Filtering and Subsetting

**Data wrangling** is a critical step in the data analysis process, involving the transformation of raw data into a suitable format for analysis. One of the essential operations within data wrangling is **filtering and subsetting**.

### Filtering and Subsetting: A Closer Look

Filtering and subsetting enable analysts to focus on specific portions of a dataset that are relevant to their research questions. By isolating data based on specific criteria, analysts can:

* **Reduce noise:** Remove irrelevant or redundant data points.
* **Improve accuracy:** Focus on data that aligns with the research objectives.
* **Enhance efficiency:** Streamline analysis by working with smaller, more manageable datasets.

### Common Filtering and Subsetting Techniques

* **Boolean Indexing:**
  - Create logical expressions to filter data based on conditions.
  - For example, filter rows where a specific column value is greater than a threshold.
* **Slicing:**
  - Extract specific rows or columns using indexing.
  - Useful for selecting subsets based on their position within the dataset.
* **Masking:**
  - Create a boolean mask that indicates which elements to keep or discard.
  - Often used in combination with other filtering techniques.
* **Groupby and Aggregation:**
  - Group data by specific criteria and apply aggregation functions (e.g., mean, sum, count) to summarize the filtered subsets.

### Real-World Applications

Filtering and subsetting are essential in various data analysis domains:

* **Customer Segmentation:** Identify target customer groups based on demographics, purchase history, or other relevant factors.
* **Financial Analysis:** Analyze stock market data by filtering for specific time periods, industries, or company sizes.
* **Scientific Research:** Isolate experimental data based on control groups, treatment variables, or measurement criteria.
* **Data Visualization:** Create more informative visualizations by focusing on specific subsets of data.

By mastering filtering and subsetting techniques, analysts can effectively transform raw data into valuable insights that drive informed decision-making.


## Filtering and Subsetting Techniques: Syntax

### 1. Boolean Indexing

* **Syntax:**
  ```python
  filtered_data = data[condition]
  ```
  - `data`: The original DataFrame.
  - `condition`: A Boolean expression that evaluates to True or False for each row.

* **Example:**
  ```python
  import pandas as pd

  data = pd.DataFrame({
      'Column1': [1, 2, 3, 4, 5],
      'Column2': ['A', 'B', 'A', 'C', 'B']
  })

  # Filter rows where 'Column1' is greater than 3
  filtered_data = data[data['Column1'] > 3]
  ```

### 2. Slicing

* **Syntax:**
  ```python
  subset_data = data.iloc[start:stop:step]  # For integer-based indexing
  subset_data = data.loc[start:stop:step]  # For label-based indexing
  ```
  - `start`: The starting index or label.
  - `stop`: The ending index or label (exclusive).
  - `step`: The step size (optional).

* **Example:**
  ```python
  # Extract rows 2 to 4
  subset_data = data.iloc[2:5]

  # Extract rows with labels 'A' to 'C'
  subset_data = data.loc['A':'C']
  ```

### 3. Masking

* **Syntax:**
  ```python
  mask = condition
  subset_data = data[mask]
  ```
  - `condition`: A Boolean expression that evaluates to True or False for each row.
  - `mask`: A Boolean Series that indicates which rows to keep.

* **Example:**
  ```python
  # Create a mask for rows where 'Column2' is 'A'
  mask = data['Column2'] == 'A'

  # Extract rows based on the mask
  subset_data = data[mask]
  ```

### 4. Groupby and Aggregation

* **Syntax:**
  ```python
  grouped_data = data.groupby('column_name')
  aggregated_data = grouped_data.agg('function_name')
  ```
  - `column_name`: The column to group by.
  - `function_name`: The aggregation function to apply (e.g., 'mean', 'sum', 'count').

* **Example:**
  ```python
  # Group by 'Column2' and calculate the mean of 'Column1'
  grouped_data = data.groupby('Column2')
  aggregated_data = grouped_data['Column1'].mean()
  ```

By combining these techniques, you can effectively filter and subset your data to extract the information you need for your analysis.


Read more:
https://www.fintechfutures.com/files/2017/10/Trifacta_Principles-of-Data-Wrangling.pdf

We would do a simple Data Wrangling task using the dataset, gapminder.csv

Click the link for the metadata for the dataset
https://zief0002.github.io/miniature-garbanzo/codebooks/gapminder.html

In [2]:
# import package
import pandas as pd

In [3]:
# load the data,
df = pd.read_csv(r'/content/gapminder_with_codes.csv') #replace 'data' with the data source

In [4]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


In [None]:
# Check columns
df.columns

In [5]:
# Unique values in 'country' and 'continent'
df['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Leba

We use unique() to check the unique values in a column, this gives us idea of possible subsets of our data that we can create

In [6]:
df.continent.unique()

array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)

In [14]:
df.year.unique()

array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])

In [7]:
# Subset for the year 2007
df['year'] == 2007

Unnamed: 0,year
0,False
1,False
2,False
3,False
4,False
...,...
1699,False
1700,False
1701,False
1702,False


In [11]:
subset_temporal = df[df['year'] == 2007]
subset_temporal.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
11,Afghanistan,Asia,2007,43.828,31889923,974.580338,AFG,4
23,Albania,Europe,2007,76.423,3600523,5937.029526,ALB,8
35,Algeria,Africa,2007,72.301,33333216,6223.367465,DZA,12
47,Angola,Africa,2007,42.731,12420476,4797.231267,AGO,24
59,Argentina,Americas,2007,75.32,40301927,12779.37964,ARG,32


Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
11,Afghanistan,Asia,2007,43.828,31889923,974.580338,AFG,4
23,Albania,Europe,2007,76.423,3600523,5937.029526,ALB,8
35,Algeria,Africa,2007,72.301,33333216,6223.367465,DZA,12
47,Angola,Africa,2007,42.731,12420476,4797.231267,AGO,24
59,Argentina,Americas,2007,75.32,40301927,12779.37964,ARG,32


In [None]:
subset_temporal

In [None]:
# Subset for the continent 'Asia'
subset_geographical = df[df['continent'] == 'Asia']

In [None]:
subset_geographical

In [None]:
# Subset for India in 2007
india07 = df[(df['country'] == 'India') & (df['year'] == 2007)]

In [None]:
india07

In [None]:
# Filter life expectancy > 80 and < 30
df[df['lifeExp'] > 80]

In [None]:
df[df['lifeExp'] < 30]

In [None]:
# comparison for specific countries
df[df['country'].isin(['United States', 'China', 'India'])]

In [None]:
# Additional filters for Africa and GDP per capita < 5000
africa_less_5000 = df[(df['continent'] == 'Africa') & (df['gdpPercap'] < 5000)]

In [None]:
africa_less_5000[africa_less_5000['year'] == 2007]

In [None]:
df[(df['continent'] == 'Africa') | (df['gdpPercap'] < 5000)]

In [None]:
# Use Case: Analyzing Population Growth in a Specific Decade

Use Case: Analyzing Population Growth in a Specific Decade

In this use case, we will focus on subsetting and filtering the Gapminder dataset to analyze population growth trends in a specific decade for a chosen continent. The goal is to understand how the population has changed over time and identify factors contributing to this change.

In [15]:
unique_continents = df.continent.unique()

In [16]:
selected_continent = 'Africa'

continent_data = df[df['continent'].isin([selected_continent])]

decade_start = 1997
decade_end = 2007

decade_data = continent_data[(continent_data['year'] >= decade_start) & (continent_data['year'] <= decade_end)]

total_pop_per_country = decade_data.groupby('country')['pop'].sum().reset_index()



In [17]:
total_pop_per_country

Unnamed: 0,country,pop
0,Algeria,93692373
1,Angola,33161606
2,Benin,21170507
3,Botswana,4806014
4,Burkina Faso,36930255
5,Burundi,21533193
6,Cameroon,47822090
7,Central African Republic,12113564
8,Chad,26636557
9,Comoros,1853324


In [18]:
max_pop_country = total_pop_per_country.loc[total_pop_per_country['pop'].idxmax()]

In [19]:
# display results

print(f"\nPopulation Analysis for {selected_continent} in the {decade_start}-{decade_end} Decade:")
print(f"\nCountry with the Highes Population: {max_pop_country['country']}")
print(f"\n Total Population in {decade_start} - {decade_end}: {max_pop_country['pop']}")


Population Analysis for Africa in the 1997-2007 Decade:

Country with the Highes Population: Nigeria

 Total Population in 1997 - 2007: 361140277


In [None]:

attach reference to data wrangling book

Assignment:
1. For each cell where subsetting was applied, use a different technique to perform the same operation.

2. perform analysis to find out the country with the lowest population between 1997-2007, also find out their population within that period.