# Feature Engineering with Pandas

This notebook covers various aspects of handling data in Pandas, including types of data, selecting data types, dealing with categorical variables, encoding, handling uncommon categories, numeric variables, binarizing, binning, and handling missing data.

We will use real datasets:
- Stack Overflow Developer Survey 2023: https://raw.githubusercontent.com/Stephen137/stack_overflow_developer_survey_2023/main/data/survey_results_public_2023.csv
- NYC Restaurant Inspection Results: https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD

Note: The Stack Overflow data has columns like 'Country', 'ConvertedCompYearly' (similar to ConvertedSalary), etc.

## Types of Data

- Continuous data
- Categorical (e.g., gender, birth country)
- Ordinal: order without actual distance
- Boolean
- Date time

## Exercise 1: Loading Data and Checking Types

In [3]:
import pandas as pd

# Define the URL for Stack Overflow survey
so_survey_csv = "survey_results_schema.csv"
# Load the data
so_survey_df = pd.read_csv(so_survey_csv)

# Print the first five rows
print(so_survey_df.head())

# Print the data types
print('\nColumn Data Types:')
print(so_survey_df.dtypes)

      qid       qname                                           question  \
0    QID2  MainBranch  Which of the following options best describes ...   
1  QID127         Age                                 What is your age?*   
2  QID296  Employment  Which of the following best describes your cur...   
3  QID308  RemoteWork  Which best describes your current work situation?   
4  QID341       Check  Just checking to make sure you are paying atte...   

  force_resp type selector  
0       True   MC     SAVR  
1       True   MC     SAVR  
2       True   MC     MAVR  
3      False   MC     SAVR  
4       True   MC     SAVR  

Column Data Types:
qid           object
qname         object
question      object
force_resp    object
type          object
selector      object
dtype: object


## Selecting Specific Data Types

In [4]:
# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names
print(so_numeric_df.columns)

Index([], dtype='object')


## Dealing with Categorical Variables

We encode categorical variables into numbers or booleans.

Types of encoding:
1. One-Hot Encoding: n categories into n features.
2. Dummy Encoding: n categories into n-1 features, omitting one to avoid collinearity.

## One-Hot Encoding Example

In [8]:
import pandas as pd

# Create a sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack',
             'Kira', 'Liam', 'Mona', 'Nina', 'Oscar', 'Paul', 'Quinn', 'Rita', 'Sam', 'Tina'],
    'Country': ['USA', 'India', 'USA', 'Germany', 'India', 'Nepal', 'Germany', 'USA', 'Nepal', 'India',
                'Germany', 'USA', 'Nepal', 'India', 'USA', 'Germany', 'Nepal', 'India', 'USA', 'Nepal']
}

df = pd.DataFrame(data)

# Display the dataset
# print(df)


# Convert the Country column to one-hot encoded DataFrame
one_hot_encoded = pd.get_dummies(df, columns=['Country'], prefix='OH')

# Print the column names
print(one_hot_encoded.columns)

Index(['Name', 'OH_Germany', 'OH_India', 'OH_Nepal', 'OH_USA'], dtype='object')


## Dummy Encoding Example

In [10]:
# Create dummy variables for the Country column
dummy = pd.get_dummies(df, columns=['Country'], drop_first=True, prefix='DM')

# Print the column names
print(dummy.columns)

Index(['Name', 'DM_India', 'DM_Nepal', 'DM_USA'], dtype='object')


## Dealing with Uncommon Categories

In [13]:
# Create a series out of the Country column
countries = df['Country']

# Get the counts of each category
country_counts = countries.value_counts()

# Create a mask for categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)

# Label all other categories as 'Other'
df.loc[mask, 'Country'] = 'Other'

# Print the updated category counts
print(df['Country'].value_counts())

Country
Other    20
Name: count, dtype: int64


## Numeric Variables

Example with Restaurant Data: Binarizing violations.

In [None]:
# Load NYC Restaurant Inspection Data
restaurant_csv = 'https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD'
restaurant_df = pd.read_csv(restaurant_csv)

# Print head
print(restaurant_df.head())

# For simplicity, assume 'SCORE' represents violation score (higher score = more violations)
# Create a binary column: has_violation if SCORE > 0
restaurant_df['has_violation'] = 0
restaurant_df.loc[restaurant_df['SCORE'] > 0, 'has_violation'] = 1

# Print sample
print(restaurant_df[['SCORE', 'has_violation']].head())

## Binning Numeric Data

In [None]:
# Back to SO data for salary binning
# Note: Column is 'ConvertedCompYearly'

# Create Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0

# Replace where ConvertedCompYearly > 0
so_survey_df.loc[so_survey_df['ConvertedCompYearly'] > 0, 'Paid_Job'] = 1

# Print sample
print(so_survey_df[['Paid_Job', 'ConvertedCompYearly']].head())

In [None]:
import numpy as np

# Specify bin boundaries
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the ConvertedCompYearly
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedCompYearly'], bins=bins, labels=labels)

# Print sample
print(so_survey_df[['boundary_binned', 'ConvertedCompYearly']].head())

## Handling Gaps in Data (Missing Values)

In [None]:
# Check info
so_survey_df.info()

# Check missing values
print(so_survey_df.isnull().sum())

In [None]:
# For restaurant data
restaurant_df.info()

print(restaurant_df.isnull().sum())