# Data Transformation
---

## 1. Introduction

This notebook demonstrates the **Transform (T)** phase of the ETL process.  
The goal of this phase is to **clean, standardize, enrich, and prepare** both the *raw* and *incremental* datasets for analytical use in later stages.

The transformation steps applied here include:

1. Handling missing values  
2. Removing duplicates  
3. Standardizing date and text formats  
4. Enriching the data with derived features  
5. Categorizing continuous variables  

Each transformation is demonstrated with **before-and-after** outputs and brief discussions to show its effect on the dataset.  
The final, cleaned datasets will be saved in the `/transformed/` directory for subsequent use.

In [1]:
# 1. Library Imports and Data Loading

import pandas as pd
import numpy as np
import os

# Define data directory
data_dir = "data"

# Define file paths
raw_path = os.path.join(data_dir, "raw_data.csv")
incremental_path = os.path.join(data_dir, "incremental_data.csv")

# Load validated datasets
df_raw = pd.read_csv(raw_path)
df_incremental = pd.read_csv(incremental_path)

# Preview datasets for confirmation
print("Raw dataset shape:", df_raw.shape)
print("Incremental dataset shape:", df_incremental.shape)

display(df_raw.head(2))
display(df_incremental.head(2))

Raw dataset shape: (14640, 15)
Incremental dataset shape: (2000, 15)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52-08:00,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59-08:00,,Pacific Time (US & Canada)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570310600460525568,negative,0.6292,Flight Booking Problems,0.3146,US Airways,,jhazelnut,,0,@USAirways is there a better time to call? My...,,2015-02-24 11:53:37-08:00,,
1,570310144459972608,negative,1.0,Customer Service Issue,1.0,US Airways,,GAKotsch,,0,@USAirways and when will one of these agents b...,,2015-02-24 11:51:48-08:00,,Atlantic Time (Canada)


## 2. Data Transformations

In this section, I perform a series of transformations to ensure both datasets are clean, standardized, and ready for analysis.  

---

### Transformation 1: Handling Missing Values

The dataset contains missing entries in several columns — notably `negativereason` and `tweet_location`.  

To maintain record completeness while preserving interpretability:

- Missing values in `negativereason` are replaced with `"Unknown"`,since this field represents the reason behind a negative sentiment and can logically default to “Unknown” when unspecified.
- Missing values in `tweet_location` and `user_timezone` are left as-is because imputing them could introduce false geographic information.


In [2]:
# --- BEFORE ---
print("Missing values before transformation:")
print(df_raw.isnull().sum()[df_raw.isnull().sum() > 0])

# Apply transformation
cols_to_fill = ['negativereason']
for col in cols_to_fill:
    df_raw[col] = df_raw[col].fillna('Unknown')
    df_incremental[col] = df_incremental[col].fillna('Unknown')

# --- AFTER ---
print("\nMissing values after transformation:")
print(df_raw.isnull().sum()[df_raw.isnull().sum() > 0])

Missing values before transformation:
negativereason                5462
negativereason_confidence     4118
airline_sentiment_gold       14600
negativereason_gold          14608
tweet_coord                  13621
tweet_location                4733
user_timezone                 4820
dtype: int64

Missing values after transformation:
negativereason_confidence     4118
airline_sentiment_gold       14600
negativereason_gold          14608
tweet_coord                  13621
tweet_location                4733
user_timezone                 4820
dtype: int64


### **Discussion:**  
After the transformation, all missing values in the `negativereason` column were successfully replaced with `"Unknown"`.  
The other missing fields (`tweet_location`, `user_timezone`) were retained in their original form to prevent misleading imputation.

---
### Transformation 2: Removing Duplicates

Duplicate records can distort summary statistics, bias analyses, and inflate record counts.  
Since each tweet in the dataset is uniquely identified by its `tweet_id`, we use this field to detect and remove duplicates.  

This ensures that every observation in the dataset represents a distinct tweet event.

In [3]:
# --- BEFORE ---
print("Duplicate records before transformation:")
print(df_raw.duplicated(subset='tweet_id').sum(), "duplicates in raw dataset")
print(df_incremental.duplicated(subset='tweet_id').sum(), "duplicates in incremental dataset")

# Apply transformation
df_raw = df_raw.drop_duplicates(subset='tweet_id').reset_index(drop=True)
df_incremental = df_incremental.drop_duplicates(subset='tweet_id').reset_index(drop=True)

# --- AFTER ---
print("\nDuplicate records after transformation:")
print(df_raw.duplicated(subset='tweet_id').sum(), "duplicates in raw dataset")
print(df_incremental.duplicated(subset='tweet_id').sum(), "duplicates in incremental dataset")

print("\nUpdated dataset shapes:")
print("Raw:", df_raw.shape)
print("Incremental:", df_incremental.shape)

Duplicate records before transformation:
155 duplicates in raw dataset
147 duplicates in incremental dataset

Duplicate records after transformation:
0 duplicates in raw dataset
0 duplicates in incremental dataset

Updated dataset shapes:
Raw: (14485, 15)
Incremental: (1853, 15)


After removing duplicates based on `tweet_id`, each record now uniquely represents one tweet.  
This ensures data integrity and prevents skewed analyses in later stages.

---
### Transformation 3: Standardizing Datetime Format

The `tweet_created` column is currently stored as a string, which limits its usability for time-based analysis.  
To standardize this field:

- The column is converted to `datetime` format using `pd.to_datetime()`.  
- A new column, `tweet_date`, is extracted to represent only the date portion of each record.  

This transformation enables accurate temporal grouping, filtering, and trend analysis.

In [4]:

# --- BEFORE ---
print("Before transformation:")
print(df_raw['tweet_created'].head(3))
print("\nColumn data type:", df_raw['tweet_created'].dtype)

# Apply transformation
df_raw['tweet_created'] = pd.to_datetime(df_raw['tweet_created'], errors='coerce')
df_incremental['tweet_created'] = pd.to_datetime(df_incremental['tweet_created'], errors='coerce')

# Extract date component
df_raw['tweet_date'] = df_raw['tweet_created'].dt.date
df_incremental['tweet_date'] = df_incremental['tweet_created'].dt.date

# --- AFTER ---
print("\nAfter transformation:")
print(df_raw[['tweet_created', 'tweet_date']].head(3))
print("\nColumn data type after:", df_raw['tweet_created'].dtype)

Before transformation:
0    2015-02-24 11:35:52-08:00
1    2015-02-24 11:15:59-08:00
2    2015-02-24 11:15:48-08:00
Name: tweet_created, dtype: object

Column data type: object

After transformation:
              tweet_created  tweet_date
0 2015-02-24 11:35:52-08:00  2015-02-24
1 2015-02-24 11:15:59-08:00  2015-02-24
2 2015-02-24 11:15:48-08:00  2015-02-24

Column data type after: datetime64[ns, UTC-08:00]


**Discussion:**  
The `tweet_created` column has been successfully converted from string format to a proper `datetime` object, and a new `tweet_date` field was created to represent the calendar date of each tweet.  


---
### Transformation 4: Enriching the Dataset with Tweet Length

To enrich the dataset with an additional analytical feature, I created a new column called `tweet_length`.  
This column captures the number of characters in each tweet (based on the `text` field).  

The metric provides insight into how expressive or concise user feedback tends to be,  
and it can later be used in exploratory visualizations or feature engineering.

In [8]:
# Transformation 4: Enriching the Dataset with Tweet Length

# --- BEFORE ---
print("Before transformation:")
print(df_raw[['text']].head(3))

# Apply transformation
df_raw['tweet_length'] = df_raw['text'].astype(str).apply(len)
df_incremental['tweet_length'] = df_incremental['text'].astype(str).apply(len)

# --- AFTER ---
print("\nAfter transformation:")
print(df_raw[['text', 'tweet_length']].head(3))

print("\nIncremental dataset sample:")
print(df_incremental[['text', 'tweet_length']].head(3))

Before transformation:
                                                text
0                @VirginAmerica What @dhepburn said.
1  @VirginAmerica plus you've added commercials t...
2  @VirginAmerica I didn't today... Must mean I n...

After transformation:
                                                text  tweet_length
0                @VirginAmerica What @dhepburn said.            35
1  @VirginAmerica plus you've added commercials t...            72
2  @VirginAmerica I didn't today... Must mean I n...            71

Incremental dataset sample:
                                                text  tweet_length
0  @USAirways  is there a better time to call? My...           128
1  @USAirways and when will one of these agents b...            67
2  @JetBlue Yesterday on my way from EWR to FLL j...           115


**Discussion:**  
Both datasets have been enriched with a new variable, `tweet_length`, representing the total number of characters per tweet.  
This transformation enables future correlation analysis between tweet sentiment and number of words used in the tweet for example.  


---
### Transformation 5: Deriving Sentiment Category Flags

To allow for future numerical analysis or potential model training,  
I created three binary indicator columns representing the sentiment categories:

- `sentiment_positive` → 1 if sentiment is *positive*, else 0  
- `sentiment_neutral` → 1 if sentiment is *neutral*, else 0  
- `sentiment_negative` → 1 if sentiment is *negative*, else 0  

This transformation converts the text into a structured numeric format for easier aggregation and analysis.

In [10]:
# --- BEFORE ---
print("Before transformation:")
print(df_raw[['airline_sentiment']].head(3))

# Apply transformation
for df in [df_raw, df_incremental]:
    df['sentiment_positive'] = (df['airline_sentiment'].str.lower() == 'positive').astype(int)
    df['sentiment_neutral'] = (df['airline_sentiment'].str.lower() == 'neutral').astype(int)
    df['sentiment_negative'] = (df['airline_sentiment'].str.lower() == 'negative').astype(int)

# --- AFTER ---
print("\nAfter transformation:")
print(df_raw[['airline_sentiment', 'sentiment_positive', 'sentiment_neutral', 'sentiment_negative']].head(3))

Before transformation:
  airline_sentiment
0           neutral
1          positive
2           neutral

After transformation:
  airline_sentiment  sentiment_positive  sentiment_neutral  sentiment_negative
0           neutral                   0                  1                   0
1          positive                   1                  0                   0
2           neutral                   0                  1                   0


**Discussion:**  
The categorical variable `airline_sentiment` has been converted into three binary indicator columns.  
Each column now explicitly identifies whether a record represents a positive, neutral, or negative tweet.  


---
### Transformation 6: Standardizing Airline Names

During extraction, the `airline` field contained valid names such as *Virgin America*, *United*, and *American*.  
However, inconsistencies in capitalization or spacing can cause grouping and aggregation errors.

To standardize this column, all airline names were converted to title case using the `str.title()` method.  
This ensures consistent formatting across both datasets.

In [None]:
# --- BEFORE ---
print("Before transformation:")
print(df_raw['airline'].unique())

# Apply transformation
for df in [df_raw, df_incremental]:
    df['airline'] = df['airline'].str.strip().str.title()

# --- AFTER ---
print("\nAfter transformation:")
print(df_raw['airline'].unique())

Before transformation:
['Virgin America' 'United' 'Southwest' 'Delta' 'US Airways' 'American']

After transformation:
['Virgin America' 'United' 'Southwest' 'Delta' 'Us Airways' 'American']


---
## 3. Saving Transformed Data

After applying all transformations, the final step involves saving the processed versions of both datasets. 


In [13]:
# 3. Saving Transformed Data

# Ensure transformed directory exists
os.makedirs("transformed", exist_ok=True)

# Define output paths
raw_transformed_path = os.path.join("transformed", "transformed_raw.csv")
incremental_transformed_path = os.path.join("transformed", "transformed_incremental.csv")

# Save transformed datasets
df_raw.to_csv(raw_transformed_path, index=False)
df_incremental.to_csv(incremental_transformed_path, index=False)

# Verify save
print("Transformed datasets saved to 'transformed/' directory.")
print("Files:")
print(f"- {raw_transformed_path}")
print(f"- {incremental_transformed_path}")

Transformed datasets saved to 'transformed/' directory.
Files:
- transformed/transformed_raw.csv
- transformed/transformed_incremental.csv
