## BUSINESS UNDERSTANDING

Overview
Kenya Power and Lighting Company (KPLC) often receives a high volume of tweets from customers reporting issues, asking questions, or providing feedback.Understanding customer sentiment towards KPLC is crucial to enable automating of responses, enhancing customer service efficiency, improving response times, and reduce the manual workload on customer service teams. The goal is to develop a chatbot capable of classifying various types of tweets and generating appropriate automated responses.


## Problem Statement
KPLC needs an automated sentiment analysis system to process and categorize customer feedback from social media, particularly X formerly (Twitter) where customers frequently express their sentiments regarding KPLC's services. By accurately classifying tweets related to KPLC’s services into sentiment categories the system will be able to identify issues by pinpointing common complaints and service issues and enhance customer feedback

### Objectives

* To gauge overall customer sentiment towards KPLC's services.·   

* To Identify specific issues mentioned in the tweets, such as token problems, power outages, billing issues, etc.

* To Create a chatbot that provides appropriate responses to customer inquiries


### Challenges
1. Data Collection and Preprocessing:
Gathering relevant tweets mentioning KPLC, especially when customers use various hashtags, misspellings or slang, can be difficult. Additionally, cleaning and preprocessing the data (e.g., removing noise like unrelated tweets, abbreviations) is crucial but time-consuming.

2. Sentiment Analysis Accuracy:
Accurately classifying the sentiment of tweets can be challenging due to the informal language, sarcasm, mixed sentiments and local dialects often used on X/Twitter.

3. Identifying Specific Issues:
Extracting and categorizing specific issues (e.g power outages, billing issues) mentioned in tweets can be complex due to the diverse ways in which customers describe their problems.

4. Real-time Data Processing:
Processing a continuous stream of tweets in real-time to provide timely insights and responses is demanding in terms of computational resources and model efficiency.

5. Handling Multilingual and Local Dialects:
Tweets may be in multiple languages or include local dialects, which can complicate sentiment analysis and issue detection. 
6. Evaluating Model Performance:
Ensuring the models perform well across different contexts, languages, and over time requires ongoing evaluation and tuning.




### Proposed Solution

* Use advanced Natural Language Processing (NLP) techniques and APIs (e.g., Twitter API) to collect and preprocess tweets.

* Implement data cleaning scripts to filter out irrelevant data and normalize the text for consistent analysis. 

* Train sentiment analysis models using machine learning techniques such as supervised learning with labeled datasets

* Implement a robust pipeline using tools for real-time data streaming and processing. Integrate with scalable cloud services such as AWS or Google Cloud to ensure the system can handle large volumes of data efficiently.

* Utilize existing chatbot frameworks like Rasa, integrated with the sentiment analysis and issue categorization models. This chatbot should be able to provide relevant responses based on the sentiment and identified issues and direct users to appropriate resources or support channels.

* Incorporate multilingual NLP models and fine-tune them with local dialect data. Using translation APIs where necessary to standardize inputs before analysis.

* Set up a continuous evaluation framework using A/B testing, cross-validation and performance metrics such as accuracy, F1-score and precision/recall. Regularly retrain models with new data to adapt to evolving customer language and sentiment.



### Metrics of success:

* Sentiment Accuracy: Percentage of correctly classified sentiments (positive, negative, neutral).

* Issue Detection Rate: Number of key issues identified and addressed based on sentiment analysis.


### Conclusion
The analysis of the tweets reveals that for Kenya Power and Lightning Company(KPLC),sentiment analysis of the tweets can o along way in assisting the company to understand and deal with customer feedback.In this way,KPLC will be able to focus on identifying the main problems developing and implementing corresponding strategies for the company’s service improvement and ultimately increasing the customer satisfaction level of their customers .The company will be able to maintain their brand image and identify the impending issues before they happen.

Despite the difficulties like dealing with  vast data and identification while analyzing the social media concerns ,performing sentiment analysis by analyzing tweets is effective.Since KLC has established key performance indicators of some of its goals such as raise in customer satisfaction scores and positive trend on brand sentiment,the company can use this tool to sustain its leadership in the energy sector while at the same time strengthening its relations with customers.


## DATA CLEANING

In [30]:
# Importing all the necessary Modules
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import numpy as np

Merging all CSV files into one CSV file.

In [31]:
# Specify the directory containing the CSV files
csv_directory = r'C:\Users\USER\Desktop\Data\PHASE5\KPLC'

# Specify the output file name
output_file = 'Data.csv'

# Create an empty list to hold DataFrames
csv_list = []

# Loop through all files in the directory
for filename in os.listdir(csv_directory):
    if filename.endswith('.csv'):
        file_path = os.path.join(csv_directory, filename)
        try:
            # Try reading the CSV file with UTF-8 encoding
            df = pd.read_csv(file_path, encoding='utf-8')
        except UnicodeDecodeError:
            # If UTF-8 fails, try with ISO-8859-1 encoding
            df = pd.read_csv(file_path, encoding='ISO-8859-1')
        
        # Append the DataFrame to the list
        csv_list.append(df)

# Concatenate all DataFrames in the list
merged_df = pd.concat(csv_list, ignore_index=True)

# Save the concatenated DataFrame to a new CSV file
merged_df.to_csv(output_file, index=False)

print(f'Merged {len(csv_list)} files into {output_file}')

Merged 4 files into Data.csv


Now we can look at the basic structure of our data.

In [32]:
#Data_Loader class loads data
class Data_Loader:
    def __init__(self, data=None):
        self.data = data

    def load_data(self, file_path):
        # If data is not already loaded, load it
        if self.data is None:
            self.data = pd.read_csv(file_path)
        return self.data

# load data from a file
df = loader.load_data(r'C:\Users\USER\Desktop\Data\PHASE5\KPLC\Data.csv')
    


let us look at the info, shape and data types that we have

In [33]:
class Data_Loader:
    def load_data(self, file_path):
        return pd.read_csv(file_path)

class Data_Informer(Data_Loader):
    def __init__(self):
        super().__init__()

    def print_info(self, df):
        # Shape of the dataframe
        print("\nShape of the dataset:")
        print(df.shape)
        
        # Column data Information
        print("\nInformation about the Dataset:")
        print(df.info())
        
        # Data Types
        data_types = df.dtypes
        print("\nColumns and their data types:")
        for column, dtype in data_types.items():
            print(f"{column}: {dtype}")

# Create an instance of Data_Informer
data_informer = Data_Informer()

# Load the dataset using the instance
data = data_informer.load_data(r"C:\Users\USER\Desktop\Data\PHASE5\KPLC\Data.csv")

# Call the print_info method to print information about the dataset
data_informer.print_info(data)

# Display the first few rows of the dataset
print(data.head())



Shape of the dataset:
(45190, 9)

Information about the Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45190 entries, 0 to 45189
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Author        45105 non-null  object 
 1   Handle        45190 non-null  object 
 2   Post          45185 non-null  object 
 3   Date          45190 non-null  object 
 4   Likes         7930 non-null   object 
 5   Reposts       1005 non-null   object 
 6   Comments      4595 non-null   float64
 7   Post Link     8115 non-null   object 
 8   Profile Lİnk  8115 non-null   object 
dtypes: float64(1), object(8)
memory usage: 3.1+ MB
None

Columns and their data types:
Author: object
Handle: object
Post: object
Date: object
Likes: object
Reposts: object
Comments: float64
Post Link: object
Profile Lİnk: object
              Author           Handle  \
0            𝒟𝓎𝓃𝒶𝓈𝓉𝓎   @dynastyycolee   
1  Abhishek 𝐁𝐚𝐜𝐡𝐜𝐡𝐚𝐧  @juniorbachchan   

Our data has 9 features and about 45000 entries with different data types such us objects(8) and integers(1).


Now we can proceed to data cleaning by removing duplicate values, Outliers and fill in any missing value using mean, median and mode

In [35]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45190 entries, 0 to 45189
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Author        45105 non-null  object 
 1   Handle        45190 non-null  object 
 2   Post          45185 non-null  object 
 3   Date          45190 non-null  object 
 4   Likes         7930 non-null   object 
 5   Reposts       1005 non-null   object 
 6   Comments      4595 non-null   float64
 7   Post Link     8115 non-null   object 
 8   Profile Lİnk  8115 non-null   object 
dtypes: float64(1), object(8)
memory usage: 3.1+ MB


In [36]:
class DataCleaner:
    def __init__(self, data):
        self.data = data
    
    def remove_outliers(self):
        numeric_columns = self.data.select_dtypes(include=['number']).columns
        for column in numeric_columns:
            Q1 = self.data[column].quantile(0.25)
            Q3 = self.data[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            self.data = self.data[~((self.data[column] < lower_bound) | (self.data[column] > upper_bound))]
        return self.data
    
    def fill_missing_values(self, strategy='mean'):
        if strategy == 'mean':
            self.data = self.data.fillna(self.data.mean())
        elif strategy == 'median':
            self.data = self.data.fillna(self.data.median())
        elif strategy == 'mode':
            self.data = self.data.fillna(self.data.mode().iloc[0])
        return self.data
    
    def remove_duplicates(self):
        # Check for duplicates
        duplicate_rows = self.data[self.data.duplicated()]
        print(f"Number of duplicate rows: {len(duplicate_rows)}")
        
        # Remove duplicates
        self.data = self.data.drop_duplicates()
        print("Duplicates removed.")
        return self.data

# Assuming `data` is your DataFrame already loaded using Data_Informer
data = pd.read_csv(r"C:\Users\USER\Desktop\Data\PHASE5\KPLC\Data.csv")  # Or loaded using Data_Informer

# Create an instance of DataCleaner with the loaded data
data_cleaner = DataCleaner(data)

# Remove duplicates
data_cleaner.remove_duplicates()

# Remove outliers from all numeric columns
data_cleaner.remove_outliers()

# Fill missing values using the specified strategy (default is 'mean')
data_cleaner.fill_missing_values(strategy='mean')

# Now, `data_cleaner.data` holds the cleaned data
print(data_cleaner.data.head())


Number of duplicate rows: 37175
Duplicates removed.
                  Author        Handle  \
77  gilbert kaunda stima    @SuufStima   
78    Sack of cool vibes   @79patrickm   
79                 Njeru  @NjeruSamuel   
80                 Njeru  @NjeruSamuel   
81   JUZTUZ K Wa ARSENAL       @juztuz   

                                                 Post          Date Likes  \
77             @KenyaPower what is prepay bill number  May 31, 2016   NaN   
78  @kenyapower @kenyapower_care power outage Gwa-...  Apr 15, 2016   NaN   
79  @KenyaPower @KenyaPower_Care  HAKUNA STIMA KIT...  May 29, 2016   NaN   
80  rudisha Stima Kitale bwana, on! Off the whole ...  Jun 24, 2016   NaN   
81  @KenyaPower you people mean that machakos yote...  Jul 11, 2016   NaN   

   Reposts  Comments                  Post Link               Profile Lİnk  
77     NaN       1.0    https://x.com/SuufStima    https://x.com/SuufStima  
78     NaN       1.0   https://x.com/79patrickm   https://x.com/79patrickm  
7

the dataset had alot of duplicate value(37,175) and the best option was to fill the missing values to avoid losing too much of the data