### Expolatory Data Analysis (EDA)

* Project Overview

The project aims to create a self-learning AI Chat-Bot for Moringa School using the Natural Language Toolkit (NLTK). This Chat-Bot is designed to enhance user interaction on the Moringa School website by providing information, answering inquiries, and improving engagement through natural language processing.
* Objectives of the EDA

The primary objective of the EDA is to thoroughly examine the dataset comprising intents, questions, and responses. The analysis aims to understand the distribution and nature of the data, identify patterns, and uncover insights that will guide the development of the AI Chat-Bot. Key areas include understanding the diversity of intents, analyzing the structure and length of questions and responses, and identifying common themes and word frequencies.
* Data Visualization

Data visualization plays a crucial role in this EDA. Various graphical representations will be employed to illustrate the findings. These include bar charts to show intent distribution, histograms for analyzing question and response lengths, and word clouds for visualizing frequent terms. These visual tools will help in making data-driven decisions and in conveying complex data in an accessible format for stakeholders.

In [None]:
import json
# import libraries
import json
import re
import nltk;
from nltk.stem import PorterStemmer;
nltk.download('vader_lexicon');
from nltk.sentiment import SentimentIntensityAnalyzer;
from nltk.tokenize import word_tokenize;
from nltk.tokenize import RegexpTokenizer;
from nltk.corpus import stopwords;
from nltk.stem import WordNetLemmatizer;
nltk.download('wordnet');
from nltk import bigrams, trigrams, FreqDist;
nltk.download('stopwords');
nltk.download('averaged_perceptron_tagger');
nltk.download('punkt');
from collections import Counter;
import numpy as np
!pip install wordcloud;
from scipy import stats
import statistics
import matplotlib.pyplot as plt;

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!




In [None]:
# loading the data
with open('../Final_Intents.json', 'r') as file:
    intents = json.load(file)

In [None]:
# Function to process and analyze the data for display
def process_intents(intents):
    for intent in intents:
        print(f"Intent: {intent.get('tag', 'No tag available')}")
        if 'questions' in intent:
            print("Questions:")
            for question in intent['questions']:
                print(f" - {question}")
        else:
            print("Questions: No questions available")

        if 'responses' in intent:
            print("Responses:")
            for response in intent['responses']:
                print(f" - {response}")
        else:
            print("Responses: No responses available")
        print("\n")

# Call the function with your intents list
process_intents(intents)

#### Text Preprocessing

In [None]:
# Normalization: Convert text to lowercase
normalized_questions = [question.lower() for intent in intents for question in intent['questions']]
# Display the first few normalized questions
print("Normalized Questions (Sample):", normalized_questions[:5])

In [None]:
# Removing punctuation and special characters
cleaned_questions = [re.sub(r'[^\w\s]', '', question) for question in normalized_questions]
print("Cleaned Questions (Sample):", cleaned_questions[:5])

In [None]:
# Tokenization: Break text into words
tokenized_questions = [question.split() for question in cleaned_questions]
# Display the first few tokenized questions
print("Tokenized Questions (Sample):", tokenized_questions[:5])

In [None]:
# Stop words removal
stop_words = set(stopwords.words('english'))
filtered_words = [[word for word in question if word not in stop_words] for question in tokenized_questions]
# Display the first few sets of filtered words
print("Filtered Words (Sample):", filtered_words[:5])


In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized_words = [[lemmatizer.lemmatize(word) for word in question] for question in filtered_words]

print("Lemmatized Words (Sample):", lemmatized_words[:5])

In [None]:
# Stemming
stemmer = PorterStemmer()
stemmed_words = [[stemmer.stem(word) for word in question] for question in filtered_words]
# Display the first few sets of stemmed words
print("Stemmed Words (Sample):", stemmed_words[:5])

## EDA
#### Intent distribution
We analyze the frequency of different intents. This can help use understand which topics are more common and might require more training data or refined responses.


In [None]:
# Statistics for the intents
intent_counts = {intent['tag']: len(intent['questions']) for intent in intents}

# Calculate additional statistics
mean_questions_per_intent = sum(intent_counts.values()) / len(intent_counts)
median_questions_per_intent = sorted(intent_counts.values())[len(intent_counts) // 2]
total_intents = len(intent_counts)

# Find the intent(s) with the fewest questions
min_questions_count = min(intent_counts.values())
intents_with_fewest_questions = [intent for intent, count in intent_counts.items() if count == min_questions_count]

# Print the calculated statistics
print(f"Total Intents: {total_intents}")
print(f"Mean Questions per Intent: {mean_questions_per_intent:.2f}")
print(f"Median Questions per Intent: {median_questions_per_intent}")
print(f"Intent(s) with Fewest Questions ({min_questions_count} questions): {', '.join(intents_with_fewest_questions)}")

* Findings;
* There are a total of 75 unique intents in the dataset. Each intent represents a specific category or topic.
* On average, each intent contains approximately 8.69 questions. This statistic provides an understanding of the typical number of questions associated with each intent.
* The median number of questions per intent is 10. This indicates that half of the intents have 10 or fewer questions, while the other half has more than 10 questions. The median offers insight into the central tendency of question counts.
* There are several intents that have the fewest questions, each containing only 5 questions. These intents may represent topics or categories with relatively less content or focus compared to others.

##### Visualizing the First 10 and Last 10 Intents by Number of Questions

In [None]:
sorted_intents = sorted(intent_counts.items(), key=lambda item: item[1], reverse=True)

# Extract top 10 and last 10 intents
top_10_intents = sorted_intents[:10]
last_10_intents = sorted_intents[-10:]
combined_intents = top_10_intents + last_10_intents

# Unpack the combined intents for plotting
tags, counts = zip(*combined_intents)

# Bar Chart for Top 10 and Last 10 Intent Counts - Rotated
plt.figure(figsize=(8, 6))
plt.barh(tags, counts, color=['skyblue']*10 + ['orange']*10)
plt.ylabel('Intents')
plt.xlabel('Number of Questions')
plt.title('Top 10 and Last 10 Intents by Number of Questions')
plt.tight_layout()
plt.show()

###  Question and Response Length Analysis

In [None]:
# Calculate the lengths of questions in all intents
# Pass space character 
question_lengths = [len(question.split()) for intent in intents for question in intent['questions']]
response_lengths = [len(response.split()) for intent in intents for response in intent.get('responses', [])]
# Calculate the mean, median, and standard deviation for question lengths
question_mean = statistics.mean(question_lengths)
question_median = statistics.median(question_lengths)
question_std_dev = statistics.stdev(question_lengths)

# Calculate the mean, median, and standard deviation for response lengths
response_mean = statistics.mean(response_lengths)
response_median = statistics.median(response_lengths)
response_std_dev = statistics.stdev(response_lengths)

# Print the statistics
print("Statistics for Question Lengths:")
print(f"Mean: {question_mean}")
print(f"Median: {question_median}")
print(f"Standard Deviation: {question_std_dev}")
print("\n")
print("Statistics for Response Lengths:")
print(f"Mean: {response_mean}")
print(f"Median: {response_median}")
print(f"Standard Deviation: {response_std_dev}")

Questions: The statistics for question lengths show that, on average, questions are relatively short (around 10 words), with a moderate level of variability.
Responses: In contrast, responses are longer on average (around 21 words), with greater variability in their lengths.

In [None]:
# Displaying the length of questions and responses
question_lengths = [len(question.split()) for intent in intents for question in intent['questions']]
response_lengths = [len(response.split()) for intent in intents for response in intent.get('responses', [])]
import matplotlib.pyplot as plt
plt.hist(question_lengths, bins=20, alpha=0.5, label='Questions')
plt.hist(response_lengths, bins=20, alpha=0.5, label='Responses')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.legend()
plt.title('Question and Response Lengths')
plt.show()

In [None]:
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.violinplot(data=[question_lengths, response_lengths], inner="quartile")
plt.xticks([0, 1], ['Questions', 'Responses'])
plt.ylabel('Word Count')
plt.title('Distribution of Word Counts in Questions and Responses')
plt.show()

### Word Frequency Analysis
** Identify the most common words

In [None]:
# Calculating word frequency
all_words = [word for question in stemmed_words for word in question]  # Flatten the list
word_freq = Counter(all_words)

# Display the most common words
print("Most Common Words:", word_freq.most_common(20))

* findings;
* The analysis provides insights into the key themes and topics present in the dataset.
* The most common words reflect a strong emphasis on courses, data science, Moringa, and educational aspects.
* With 'cours' being the most frequent word, the dataset places a substantial emphasis on educational offerings.
* 'data,' 'scienc,' and 'moringa' indicate a strong focus on data science education, aligning with industry and institutional themes.
* The repetition of terms like 'student,' 'develop,' and 'learn' underscores a learner-centric approach in the dataset.
* Understanding these frequently occurring words is valuable for gaining insights into the main subjects and focus areas of the dataset.

#### Visualizing Most Common Words

In [None]:
# Selecting the top 20 most common words
most_common_words = Counter(dict(word_freq)).most_common(20)
# Unpacking words and frequencies
words, frequencies = zip(*most_common_words)

# Creating subplots with two columns
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 8))

# Bar plot for the top 20 most common words
axes[0].bar(words, frequencies, color='skyblue')
axes[0].set_xlabel('Words')
axes[0].set_ylabel('Frequency')
axes[0].set_xticklabels(words, rotation=90)
axes[0].set_title('Top 20 Most Common Words in Questions')

# Bubble chart for the top 20 most common words
bubble_colors = [plt.cm.viridis(i) for i in range(len(words))]  # Use colormap for bubble colors
axes[1].scatter(words, [1]*len(words), s=[freq*5 for freq in frequencies], alpha=0.7, color=bubble_colors)
axes[1].set_xlabel('Words')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Word Bubble Visualization')
axes[1].set_xticklabels(words, rotation=45, ha='right')

# Adjust layout
plt.tight_layout()
plt.show();

### Word Cloud for Most Common Words

In [None]:
# Creating a word cloud for visualizing the most frequent words.

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate a word cloud
all_words_string = ' '.join([word for sublist in stemmed_words for word in sublist])
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(all_words_string)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### Bi-grams and Tri-grams Analysis
Analyze the most common bi-grams and tri-grams.

In [None]:
# Create bi-grams and tri-grams
bi_grams = list(bigrams(all_words))
tri_grams = list(trigrams(all_words))

# Frequency distribution
bi_gram_freq = FreqDist(bi_grams)
tri_gram_freq = FreqDist(tri_grams)

# Display most common bi-grams and tri-grams
print("Most Common Bi-grams:", bi_gram_freq.most_common(10))
print("Most Common Tri-grams:", tri_gram_freq.most_common(10))

* Findings
* The most common bi-gram is 'data science' with a frequency of 136, indicating a strong association between these two words.
* 'Moringa school' is also highly frequent, suggesting a focus on education or programs related to Moringa School.
* In tri-grams, 'data science course' is the most common, reinforcing the emphasis on data science education.
* 'Moringa school' appears in both bi-grams and tri-grams, indicating its relevance and prominence in the dataset.
* Other terms like 'mobile development,' 'software engineering,' and 'financial aid' also stand out in the most common n-grams.





#### Visualizing the N-grams

In [None]:
bi_gram_labels, bi_gram_counts = zip(*bi_gram_freq.most_common(10))

# Extracting labels and frequencies for tri-grams
tri_gram_labels, tri_gram_counts = zip(*tri_gram_freq.most_common(10))

# Plotting the bar plot with two columns
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

# Bar plot for bi-grams
ax1.bar(range(len(bi_gram_labels)), bi_gram_counts, color='skyblue')
ax1.set_xlabel('Bi-grams')
ax1.set_ylabel('Frequency')
ax1.set_xticks(range(len(bi_gram_labels)))
ax1.set_xticklabels(bi_gram_labels, rotation=45, ha='right')
ax1.set_title('Most Common Bi-grams')

# Bar plot for tri-grams
ax2.bar(range(len(tri_gram_labels)), tri_gram_counts, color='salmon')
ax2.set_xlabel('Tri-grams')
ax2.set_ylabel('Frequency')
ax2.set_xticks(range(len(tri_gram_labels)))
ax2.set_xticklabels(tri_gram_labels, rotation=45, ha='right')
ax2.set_title('Most Common Tri-grams')

# Adjust layout
plt.tight_layout()
plt.show()
