<a href="https://colab.research.google.com/github/ShubhamxGupta/Rental-Customer-Feedback-Analyzer/blob/main/Rental_Customer_Feedback_Analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Analyze customer reviews from "train_data.csv" and "test_data.csv" using NLTK and transformers to perform sentiment analysis and feature extraction, and display the detailed output for every entry in a tabular format.

## Load the data

### Subtask:
Load the `train_data.csv` and `test_data.csv` files into pandas DataFrames.


**Reasoning**:
Import pandas and load the train and test data into dataframes.



In [1]:
import pandas as pd

train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')

print("Train DataFrame:")
display(train_df.head())

print("\nTest DataFrame:")
display(test_df.head())

Train DataFrame:


Unnamed: 0,ID,Gender,Status,Children,Age,Customer_Status,Car_Owner,Customer_Service,Satisfaction,Business_Area,Action
0,1000,Female,F,0,29.07,Active,Yes,"Vehicle was not clean upon delivery, disappoin...",1,Service: Orders/Contracts,Standard Pickup
1,1001,Male,F,0,29.08,Active,No,"All went quite smoothly... it was Enterprise, ...",0,Service: Accessibility,Visit Nearest Center
2,1002,Male,F,3,52.7,Inactive,No,"Long waiting time for pickup, very inconvenient.",0,Service: Attitude,Contact Support
3,1003,Female,M,2,22.79,Active,Yes,"Vehicle was not clean upon delivery, disappoin...",1,Service: Accessibility,On-demand pickup location
4,1004,Female,F,1,36.25,Active,No,"Very friendly service from the staff, quick an...",0,Service: Attitude,Contact Support



Test DataFrame:


Unnamed: 0,ID,Gender,Status,Children,Age,Customer_Status,Car_Owner,Customer_Service,Satisfaction,Business_Area,Action
0,1000,Male,M,2,48.88,Active,Yes,"Long waiting time for pickup, very inconvenient.",1,Product: Pricing and Billing,Visit Nearest Center
1,1001,Female,F,2,65.41,Active,Yes,"All went quite smoothly... it was Enterprise, ...",1,Service: Accessibility,On-demand pickup location
2,1002,Female,F,0,63.0,Inactive,Yes,"Very friendly service from the staff, quick an...",1,Service: Accessibility,
3,1003,Female,F,0,19.43,Inactive,Yes,Excellent response dealing with child seat.,1,Service: Attitude,Standard Pickup
4,1004,Female,F,1,21.16,Inactive,Yes,The agent subsequently offered me a car type o...,0,Product: Pricing and Billing,Free Upgrade


## Explore the data

### Subtask:
Display the first few rows, check for missing values, and understand the structure of the data.


**Reasoning**:
Display the first few rows, check for missing values, and understand the structure of the data for both train and test dataframes as instructed.



In [2]:
print("Train DataFrame Head:")
display(train_df.head())

print("\nTest DataFrame Head:")
display(test_df.head())

print("\nTrain DataFrame Info:")
train_df.info()

print("\nTest DataFrame Info:")
test_df.info()

print("\nMissing values in Train DataFrame:")
print(train_df.isnull().sum())

print("\nMissing values in Test DataFrame:")
print(test_df.isnull().sum())

Train DataFrame Head:


Unnamed: 0,ID,Gender,Status,Children,Age,Customer_Status,Car_Owner,Customer_Service,Satisfaction,Business_Area,Action
0,1000,Female,F,0,29.07,Active,Yes,"Vehicle was not clean upon delivery, disappoin...",1,Service: Orders/Contracts,Standard Pickup
1,1001,Male,F,0,29.08,Active,No,"All went quite smoothly... it was Enterprise, ...",0,Service: Accessibility,Visit Nearest Center
2,1002,Male,F,3,52.7,Inactive,No,"Long waiting time for pickup, very inconvenient.",0,Service: Attitude,Contact Support
3,1003,Female,M,2,22.79,Active,Yes,"Vehicle was not clean upon delivery, disappoin...",1,Service: Accessibility,On-demand pickup location
4,1004,Female,F,1,36.25,Active,No,"Very friendly service from the staff, quick an...",0,Service: Attitude,Contact Support



Test DataFrame Head:


Unnamed: 0,ID,Gender,Status,Children,Age,Customer_Status,Car_Owner,Customer_Service,Satisfaction,Business_Area,Action
0,1000,Male,M,2,48.88,Active,Yes,"Long waiting time for pickup, very inconvenient.",1,Product: Pricing and Billing,Visit Nearest Center
1,1001,Female,F,2,65.41,Active,Yes,"All went quite smoothly... it was Enterprise, ...",1,Service: Accessibility,On-demand pickup location
2,1002,Female,F,0,63.0,Inactive,Yes,"Very friendly service from the staff, quick an...",1,Service: Accessibility,
3,1003,Female,F,0,19.43,Inactive,Yes,Excellent response dealing with child seat.,1,Service: Attitude,Standard Pickup
4,1004,Female,F,1,21.16,Inactive,Yes,The agent subsequently offered me a car type o...,0,Product: Pricing and Billing,Free Upgrade



Train DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                200 non-null    int64  
 1   Gender            200 non-null    object 
 2   Status            200 non-null    object 
 3   Children          200 non-null    int64  
 4   Age               200 non-null    float64
 5   Customer_Status   200 non-null    object 
 6   Car_Owner         200 non-null    object 
 7   Customer_Service  200 non-null    object 
 8   Satisfaction      200 non-null    int64  
 9   Business_Area     200 non-null    object 
 10  Action            158 non-null    object 
dtypes: float64(1), int64(3), object(7)
memory usage: 17.3+ KB

Test DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --

## Preprocess the text data

### Subtask:
Clean and preprocess the customer review text. This may include removing special characters, punctuation, and stop words, and converting text to lowercase.


**Reasoning**:
Define a function to preprocess text and apply it to the 'Customer_Service' column of both dataframes.



In [3]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    if isinstance(text, str):
        # Convert to lowercase
        text = text.lower()
        # Remove special characters and punctuation
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        # Remove stop words
        text = ' '.join([word for word in text.split() if word not in stop_words])
        return text
    return ""

train_df['cleaned_review'] = train_df['Customer_Service'].apply(preprocess_text)
test_df['cleaned_review'] = test_df['Customer_Service'].apply(preprocess_text)

print("Train DataFrame with cleaned reviews:")
display(train_df[['Customer_Service', 'cleaned_review']].head())

print("\nTest DataFrame with cleaned reviews:")
display(test_df[['Customer_Service', 'cleaned_review']].head())

Train DataFrame with cleaned reviews:


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Customer_Service,cleaned_review
0,"Vehicle was not clean upon delivery, disappoin...",vehicle clean upon delivery disappointed
1,"All went quite smoothly... it was Enterprise, ...",went quite smoothly enterprise even picked get...
2,"Long waiting time for pickup, very inconvenient.",long waiting time pickup inconvenient
3,"Vehicle was not clean upon delivery, disappoin...",vehicle clean upon delivery disappointed
4,"Very friendly service from the staff, quick an...",friendly service staff quick easy process



Test DataFrame with cleaned reviews:


Unnamed: 0,Customer_Service,cleaned_review
0,"Long waiting time for pickup, very inconvenient.",long waiting time pickup inconvenient
1,"All went quite smoothly... it was Enterprise, ...",went quite smoothly enterprise even picked get...
2,"Very friendly service from the staff, quick an...",friendly service staff quick easy process
3,Excellent response dealing with child seat.,excellent response dealing child seat
4,The agent subsequently offered me a car type o...,agent subsequently offered car type upgrade co...


## Perform sentiment analysis

### Subtask:
Use a pre-trained model or library (like NLTK or transformers) to determine the sentiment (positive, negative, neutral) of each review.


**Reasoning**:
Import necessary NLTK modules, download the lexicon, initialize the analyzer, define the sentiment function, and apply it to the cleaned reviews in both dataframes. Finally, display the relevant columns to verify the results.



In [4]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon if not already downloaded
try:
    SentimentIntensityAnalyzer()
except LookupError:
    nltk.download('vader_lexicon')

# Initialize the VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Define a function to get sentiment label
def get_sentiment(text):
    if not isinstance(text, str) or not text.strip():
        return 'neutral' # Handle empty or non-string input gracefully
    scores = analyzer.polarity_scores(text)
    compound_score = scores['compound']
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Apply the function to create the 'sentiment' column
train_df['sentiment'] = train_df['cleaned_review'].apply(get_sentiment)
test_df['sentiment'] = test_df['cleaned_review'].apply(get_sentiment)

# Display the relevant columns to verify
print("Train DataFrame with sentiment:")
display(train_df[['Customer_Service', 'cleaned_review', 'sentiment']].head())

print("\nTest DataFrame with sentiment:")
display(test_df[['Customer_Service', 'cleaned_review', 'sentiment']].head())

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Train DataFrame with sentiment:


Unnamed: 0,Customer_Service,cleaned_review,sentiment
0,"Vehicle was not clean upon delivery, disappoin...",vehicle clean upon delivery disappointed,negative
1,"All went quite smoothly... it was Enterprise, ...",went quite smoothly enterprise even picked get...,neutral
2,"Long waiting time for pickup, very inconvenient.",long waiting time pickup inconvenient,negative
3,"Vehicle was not clean upon delivery, disappoin...",vehicle clean upon delivery disappointed,negative
4,"Very friendly service from the staff, quick an...",friendly service staff quick easy process,positive



Test DataFrame with sentiment:


Unnamed: 0,Customer_Service,cleaned_review,sentiment
0,"Long waiting time for pickup, very inconvenient.",long waiting time pickup inconvenient,negative
1,"All went quite smoothly... it was Enterprise, ...",went quite smoothly enterprise even picked get...,neutral
2,"Very friendly service from the staff, quick an...",friendly service staff quick easy process,positive
3,Excellent response dealing with child seat.,excellent response dealing child seat,positive
4,The agent subsequently offered me a car type o...,agent subsequently offered car type upgrade co...,neutral


## Perform feature extraction

### Subtask:
Identify frequently mentioned issues or keywords in the reviews. This could involve techniques like TF-IDF or topic modeling.


**Reasoning**:
Combine the cleaned reviews, apply TF-IDF, and identify the top terms to find frequently mentioned issues or keywords.



In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Combine cleaned reviews from both dataframes
all_cleaned_reviews = pd.concat([train_df['cleaned_review'], test_df['cleaned_review']], ignore_index=True)

# Instantiate TfidfVectorizer
# Using max_features to limit to the top 1000 features and ngram_range to include bigrams
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))

# Fit the vectorizer to the combined reviews and transform
tfidf_matrix = vectorizer.fit_transform(all_cleaned_reviews)

# Get feature names
feature_names = vectorizer.get_feature_names_out()

# Calculate the sum of TF-IDF scores for each feature
tfidf_sums = tfidf_matrix.sum(axis=0).A1

# Create a pandas Series of feature names and their sums
feature_tfidf_scores = pd.Series(tfidf_sums, index=feature_names)

# Sort features by their TF-IDF sums in descending order and get the top N
top_n = 20 # Define how many top terms to display
top_terms = feature_tfidf_scores.sort_values(ascending=False).head(top_n)

# Display the top terms
print(f"Top {top_n} most important terms (TF-IDF scores):")
display(top_terms)

Top 20 most important terms (TF-IDF scores):


Unnamed: 0,0
long,24.498263
time,18.107454
long lineup,16.1193
slow,16.1193
slow long,16.1193
lineup,16.1193
clean upon,14.333333
delivery,14.333333
clean,14.333333
disappointed,14.333333


## Generate summary reports

### Subtask:
Create summary reports that include overall sentiment distribution, most frequent issues, and potentially examples of reviews for each category.


**Reasoning**:
Calculate and display the sentiment distribution, print the top terms, and display example reviews for each sentiment category to create the summary reports.



In [6]:
import pandas as pd

# 1. Calculate and display sentiment distribution
print("Sentiment Distribution in Training Data:")
display(train_df['sentiment'].value_counts(normalize=True))

print("\nSentiment Distribution in Testing Data:")
display(test_df['sentiment'].value_counts(normalize=True))

# 2. Print the top_terms Series
print(f"\nTop {top_n} most important terms (TF-IDF scores):")
display(top_terms)

# Combine dataframes for sampling reviews
combined_df = pd.concat([train_df, test_df], ignore_index=True)

# 3. Sample and display example reviews for each sentiment category
print("\nExample Reviews by Sentiment:")

for sentiment_category in ['positive', 'negative', 'neutral']:
    print(f"\n--- {sentiment_category.capitalize()} Reviews ---")
    # Filter reviews by sentiment and sample a few
    example_reviews = combined_df[combined_df['sentiment'] == sentiment_category]['Customer_Service'].sample(min(5, len(combined_df[combined_df['sentiment'] == sentiment_category])))
    for i, review in enumerate(example_reviews):
        print(f"{i+1}. {review}")

Sentiment Distribution in Training Data:


Unnamed: 0_level_0,proportion
sentiment,Unnamed: 1_level_1
neutral,0.36
positive,0.355
negative,0.285



Sentiment Distribution in Testing Data:


Unnamed: 0_level_0,proportion
sentiment,Unnamed: 1_level_1
positive,0.42
neutral,0.38
negative,0.2



Top 20 most important terms (TF-IDF scores):


Unnamed: 0,0
long,24.498263
time,18.107454
long lineup,16.1193
slow,16.1193
slow long,16.1193
lineup,16.1193
clean upon,14.333333
delivery,14.333333
clean,14.333333
disappointed,14.333333



Example Reviews by Sentiment:

--- Positive Reviews ---
1. Last time I rented a car was at Manchester, NH airport and they do not have office there anymore.
2. I had a few recent rentals that have taken a very very long time, with no offer of apology.
3. I had a few recent rentals that have taken a very very long time, with no offer of apology.
4. Last time I rented a car was at Manchester, NH airport and they do not have office there anymore.
5. Last time I rented a car was at Manchester, NH airport and they do not have office there anymore.

--- Negative Reviews ---
1. Vehicle was not clean upon delivery, disappointed.
2. Long waiting time for pickup, very inconvenient.
3. Vehicle was not clean upon delivery, disappointed.
4. Vehicle was not clean upon delivery, disappointed.
5. Long waiting time for pickup, very inconvenient.

--- Neutral Reviews ---
1. The agent subsequently offered me a car type on an upgrade coupon and then told me it was no longer available.
2. Please lower the

## Display detailed output

### Subtask:
Present a detailed output of each review with its extracted sentiment and identified features in a tabular format.


**Reasoning**:
Combine the train and test dataframes, select the relevant columns, add columns for top terms presence, and display the resulting dataframe.



In [7]:
import numpy as np

# Combine the train_df and test_df DataFrames
combined_df = pd.concat([train_df, test_df], ignore_index=True)

# Select the relevant columns to display
display_df = combined_df[['Customer_Service', 'cleaned_review', 'sentiment']].copy()

# Add columns for the top terms, indicating presence
for term in top_terms.index:
    # Use boolean indexing to check if the term is in the cleaned review
    display_df[f'has_{term}'] = display_df['cleaned_review'].apply(lambda x: term in x.split())

# Display the resulting DataFrame
print("Detailed output of each review with sentiment and top term presence:")
display(display_df)

Detailed output of each review with sentiment and top term presence:


Unnamed: 0,Customer_Service,cleaned_review,sentiment,has_long,has_time,has_long lineup,has_slow,has_slow long,has_lineup,has_clean upon,...,has_vehicle clean,has_upon,has_vehicle,has_upon delivery,has_delivery disappointed,has_lower,has_lower prices,has_prices,has_please,has_please lower
0,"Vehicle was not clean upon delivery, disappoin...",vehicle clean upon delivery disappointed,negative,False,False,False,False,False,False,False,...,False,True,True,False,False,False,False,False,False,False
1,"All went quite smoothly... it was Enterprise, ...",went quite smoothly enterprise even picked get...,neutral,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,"Long waiting time for pickup, very inconvenient.",long waiting time pickup inconvenient,negative,True,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,"Vehicle was not clean upon delivery, disappoin...",vehicle clean upon delivery disappointed,negative,False,False,False,False,False,False,False,...,False,True,True,False,False,False,False,False,False,False
4,"Very friendly service from the staff, quick an...",friendly service staff quick easy process,positive,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,I had a few recent rentals that have taken a v...,recent rentals taken long time offer apology,positive,True,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
296,I had a few recent rentals that have taken a v...,recent rentals taken long time offer apology,positive,True,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
297,"Slow, long lineup",slow long lineup,neutral,True,False,False,True,False,True,False,...,False,False,False,False,False,False,False,False,False,False
298,"Last time I rented a car was at Manchester, NH...",last time rented car manchester nh airport off...,positive,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Summary:

### Data Analysis Key Findings

*   The training dataset contains 200 entries and the testing dataset contains 100 entries, both with 11 columns.
*   Missing values were only found in the 'Action' column of both dataframes (42 in train\_df and 16 in test\_df).
*   Sentiment analysis using VADER classified reviews into positive, negative, and neutral categories. The training data showed a sentiment distribution of approximately 36% neutral, 35.5% positive, and 28.5% negative. The testing data had a slightly higher proportion of positive reviews (42%), followed by neutral (38%) and negative (20%).
*   TF-IDF analysis identified the top 20 most important terms/n-grams, including "long lineup", "slow long", "clean upon", "delivery disappointed", and "lower prices," indicating common themes around waiting times, cleanliness, and pricing.

### Insights or Next Steps

*   Investigate the missing values in the 'Action' column to understand their impact and determine an appropriate handling strategy (e.g., imputation or exclusion) if this column is required for further analysis or modeling.
*   Explore the relationship between the identified top terms and the sentiment of the reviews to gain deeper insights into which specific issues drive positive or negative customer experiences.
