# Data cleaning process

In [10]:

import pandas as pd
reviews = pd.read_csv("Airlines_User-Reviews_Raw.csv")

# Display general information about the dataset
print("Dataset Info:")
print(reviews.info())

# Check for duplicate rows and display counts
duplicates_count = reviews.count() - reviews.drop_duplicates().count()
print("\nDuplicate Rows Count:")
print(duplicates_count)


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3290 entries, 0 to 3289
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Aircraft Type           1896 non-null   object 
 1   Users Reviews           3290 non-null   object 
 2   Country                 3289 non-null   object 
 3   Type_of_Travellers      2887 non-null   object 
 4   Route                   2883 non-null   object 
 5   Seat_Types              3287 non-null   object 
 6   Seat Comfort            3176 non-null   float64
 7   Date Flown              2880 non-null   object 
 8   Cabin Staff Service     3165 non-null   float64
 9   Ground Service          2812 non-null   float64
 10  Food & Beverages        2911 non-null   float64
 11  Wifi & Connectivity     592 non-null    float64
 12  Inflight Entertainment  2171 non-null   float64
 13  Value For Money         3290 non-null   int64  
 14  Recommended             32

## Exploration

### Duplicates 

##### Identifying and Handling Duplicates:

While cleaning the dataset, it's crucial to consider which entities can have duplicates and which need to be preserved for meaningful analysis. Let's explore some entities and discuss the approach to handling duplicates:

#### Aircraft Type:

Aircraft type information is crucial for understanding the context of user reviews. Duplicates in this category might indicate repeated issues or experiences with specific aircraft models. It's advisable to retain duplicates for this entity to capture patterns associated with different aircraft types.
Users Reviews:

User reviews are the core of sentiment analysis, and duplicates in this context may arise due to users sharing similar experiences. However, the presence of duplicates doesn't necessarily diminish their value. Retaining duplicate reviews allows for a more comprehensive sentiment analysis, capturing common sentiments expressed by multiple users.
Country, Type_of_Travellers, Route, Seat_Types, etc.:

These entities provide additional context to user reviews. While exact duplicates may not offer unique insights, subtle variations could be significant. For instance, the same route flown by different types of travelers might yield diverse experiences. It's essential to carefully assess duplicates in these categories.
Date Flown:

Duplicates based on the date flown might be indicative of recurring issues during specific time periods. Retaining duplicates for this entity enables the identification of trends or recurring problems associated with particular dates.
Approach to Handling Duplicates:

#### Identify Unique Identifiers:

Before deciding on duplicate handling, identify unique identifiers that distinguish between similar entities. For example, if two reviews seem similar, check if there are differences in other attributes like date, country, or aircraft type.

#### Retain for Contextual Analysis:

Consider retaining duplicates for entities like "Aircraft Type," "Users Reviews," and other relevant categories to preserve context for deeper analysis.
Consolidate or Remove Exact Duplicates:

For entities where exact duplicates provide limited value (e.g., exact duplicates in "Users Reviews"), you may choose to consolidate or remove them. This ensures that each unique sentiment contributes meaningfully to the analysis.

#### Conclusion:

Balancing the retention and removal of duplicates is a nuanced process. The decision depends on the specific objectives of your analysis and the nature of the dataset. By carefully evaluating duplicates, you can enhance the quality of insights derived from the dataset and provide a more accurate understanding of user experiences in the airline industry.

In [1]:
# Retain duplicates for contextual analysis
contextual_duplicates_cols = ["Aircraft Type", "Date Flown"]
contextual_duplicates = reviews.duplicated(subset=contextual_duplicates_cols, keep=False)
contextual_duplicates_df = reviews[contextual_duplicates]

# Consolidate or remove exact duplicates
reviews = reviews.drop_duplicates()

# Handle variations in categorical data
reviews["Country"] = reviews["Country"].str.lower()  # Example: Standardize country names to lowercase

# Address null values
# For demonstration purposes, fill null values with a placeholder
reviews = reviews.fillna("Not Available")

# Document cleaning steps in a log
cleaning_log = """
Cleaning Steps Log:
1. Identified unique identifiers: 'Aircraft Type', 'Date Flown'.
2. Retained duplicates for contextual analysis in 'contextual_duplicates_df'.
3. Consolidated or removed exact duplicates from 'reviews'.
4. Handled variations in categorical data (e.g., standardized 'Country' to lowercase).
5. Addressed null values by filling with a placeholder.
"""

# Save cleaned dataset to a new CSV file
reviews.to_csv("Airlines_User-Reviews_Cleaned.csv", index=False)

# Display cleaning log
print(cleaning_log)
print("Cleaning completed. Cleaned dataset saved to 'Airlines_User-Reviews_Cleaned.csv'.")


NameError: name 'reviews' is not defined