<a href="https://colab.research.google.com/github/BraisonWabwire/BraisonWabwire/blob/main/Untitled11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Analysis Report**
In this report, I conducted a text analysis project using Python in a Jupyter Notebook. The main tasks included importing the text, creating a search engine and recommendation system, generating text summaries using various techniques, and analyzing the results.

**Proposed Text**

I imported the following text from my proposal:

In [2]:
# Sample proposal text
text = """
Education is essential for personal and societal growth. It enables individuals to develop skills, gain knowledge, and make informed decisions.
The purpose of education goes beyond academic learning; it also fosters critical thinking, creativity, and emotional intelligence.
In recent years, technology has transformed educational practices, introducing tools like online learning platforms, virtual classrooms, and interactive simulations.
Teachers now play an even more dynamic role as facilitators of learning, guiding students to be independent thinkers and lifelong learners.
Moreover, education helps in building a more equitable society by providing opportunities for social mobility and reducing poverty.
By understanding the importance of education, societies can invest in future generations to create a better world for everyone.
"""


I broke the text into sentence segments and stored it in a Pandas DataFrame.

In [3]:
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize

# Tokenize the text into sentences
nltk.download('punkt')
sentences = sent_tokenize(text)
df = pd.DataFrame(sentences, columns=['sentence'])
df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,sentence
0,\nEducation is essential for personal and soci...
1,"It enables individuals to develop skills, gain..."
2,The purpose of education goes beyond academic ...
3,"In recent years, technology has transformed ed..."
4,Teachers now play an even more dynamic role as...


**Create A Search Engine**

I created a simple search engine using each sentence as my "documents." I implemented a search function to find specific pieces of text.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize and fit the vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['sentence'])

def search(query, tfidf_matrix, vectorizer, sentences):
    query_vec = vectorizer.transform([query])
    results = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = results.argsort()[-3:][::-1]  # Top 3 results
    return sentences.iloc[top_indices]

# Example search query
search_results = search("importance of education", tfidf_matrix, vectorizer, df['sentence'])
print("Search Results for 'importance of education':")
print(search_results)


Search Results for 'importance of education':
6    By understanding the importance of education, ...
2    The purpose of education goes beyond academic ...
0    \nEducation is essential for personal and soci...
Name: sentence, dtype: object


I examined the search results and was pleased to find that the search engine accurately retrieved relevant sentences related to the query. It effectively highlighted the importance of education in societal growth and personal development.

**Create A Recommendation System**

Using the same sentences, I developed a recommendation system to find similar sentences based on their themes.

In [5]:
def recommend(sentence, tfidf_matrix, vectorizer, sentences):
    query_vec = vectorizer.transform([sentence])
    results = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = results.argsort()[-3:][::-1]
    return sentences.iloc[top_indices]

# Example recommendation based on a theme
recommendation_results = recommend("education and social growth", tfidf_matrix, vectorizer, df['sentence'])
print("\nRecommendation Results for 'education and social growth':")
print(recommendation_results)



Recommendation Results for 'education and social growth':
0    \nEducation is essential for personal and soci...
5    Moreover, education helps in building a more e...
2    The purpose of education goes beyond academic ...
Name: sentence, dtype: object


The recommendation system successfully identified sentences that discussed education's role in societal growth and poverty reduction. However, I noted that a more diverse set of texts discussing various educational perspectives might improve the recommendations.

**Create Text Summaries**

I created a human summary of the text to capture the main ideas.

**Human Summary:** Education fosters critical thinking and creativity, and it plays a key role in reducing poverty and building a better society.

Next, I implemented LSA, TextRank, and Topic Modeling to generate automatic summaries.

1. LSA Summary

In [6]:
from sklearn.decomposition import TruncatedSVD

# Apply LSA
svd = TruncatedSVD(n_components=1)
lsa_summary = svd.fit_transform(tfidf_matrix)

# Combine with original sentences and get top N sentences
lsa_df = pd.DataFrame(lsa_summary, columns=['lsa_score'])
lsa_sentences = pd.concat([df, lsa_df], axis=1)
top_n = 2  # Number of sentences in the summary
top_lsa_sentences = lsa_sentences.nlargest(top_n, 'lsa_score')['sentence'].values
print("LSA Summary:")
print(top_lsa_sentences)


LSA Summary:
['By understanding the importance of education, societies can invest in future generations to create a better world for everyone.'
 'Moreover, education helps in building a more equitable society by providing opportunities for social mobility and reducing poverty.']


2.TextRank Summary

In [7]:
import networkx as nx

def textrank_summary(sentences):
    # Create a similarity matrix
    similarity_matrix = cosine_similarity(tfidf_matrix)
    nx_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(nx_graph)
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    return [s[1] for s in ranked_sentences[:2]]  # Top 2 sentences

top_textrank_sentences = textrank_summary(df['sentence'])
print("\nTextRank Summary:")
print(top_textrank_sentences)



TextRank Summary:
['By understanding the importance of education, societies can invest in future generations to create a better world for everyone.', 'Moreover, education helps in building a more equitable society by providing opportunities for social mobility and reducing poverty.']


In [10]:
!pip install gensim nltk



In [13]:
!pip install gensim nltk pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


Topic Modeling Summary

In [14]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Tokenize the text into sentences
sentences = nltk.sent_tokenize(text)
df = pd.DataFrame(sentences, columns=['sentence'])

# Define stop words
stop_words = set(stopwords.words('english'))

# Tokenize and clean the sentences
processed_docs = []
for sentence in df['sentence']:
    tokens = nltk.word_tokenize(sentence.lower())  # Lowercase and tokenize
    filtered_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]  # Remove stop words and punctuation
    processed_docs.append(filtered_tokens)  # Append cleaned tokens to the list

# Create a dictionary and corpus for LDA
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Train the LDA model
num_topics = 2  # Adjust based on your needs
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

# Display the topics
print("\nLDA Topics:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

# Visualize the topics
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.



LDA Topics:
Topic 0: 0.043*"education" + 0.025*"society" + 0.025*"equitable" + 0.025*"providing" + 0.025*"mobility" + 0.025*"social" + 0.025*"building" + 0.025*"opportunities" + 0.025*"reducing" + 0.025*"moreover"
Topic 1: 0.038*"learning" + 0.026*"education" + 0.016*"online" + 0.016*"recent" + 0.016*"like" + 0.016*"years" + 0.016*"technology" + 0.016*"classrooms" + 0.016*"educational" + 0.016*"platforms"


In [16]:
!pip install rouge-score

  and should_run_async(code)


Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=c3557953b3ca471423d68652d3809d27a6fd6e8c9ba7761c22ee0e7811c699b5
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


**Evaluate the Summaries**

To assess the generated summaries, I compared them to the human summary. I used the ROUGE-N analyzer to quantify the similarity.

In [17]:
from rouge_score import rouge_scorer

# Initialize ROUGE
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Assuming I have a human summary for comparison
human_summary = "Education fosters critical thinking and creativity, and it plays a key role in reducing poverty and building a better society."
lsa_summary_text = " ".join(top_lsa_sentences)
textrank_summary_text = " ".join(top_textrank_sentences)

# Get ROUGE scores
scores_lsa = rouge.score(lsa_summary_text, human_summary)
scores_textrank = rouge.score(textrank_summary_text, human_summary)

print("\nROUGE Scores for LSA Summary:", scores_lsa)
print("ROUGE Scores for TextRank Summary:", scores_textrank)



ROUGE Scores for LSA Summary: {'rouge1': Score(precision=0.5, recall=0.2702702702702703, fmeasure=0.3508771929824562), 'rouge2': Score(precision=0.15789473684210525, recall=0.08333333333333333, fmeasure=0.1090909090909091), 'rougeL': Score(precision=0.3, recall=0.16216216216216217, fmeasure=0.2105263157894737)}
ROUGE Scores for TextRank Summary: {'rouge1': Score(precision=0.5, recall=0.2702702702702703, fmeasure=0.3508771929824562), 'rouge2': Score(precision=0.15789473684210525, recall=0.08333333333333333, fmeasure=0.1090909090909091), 'rougeL': Score(precision=0.3, recall=0.16216216216216217, fmeasure=0.2105263157894737)}


  and should_run_async(code)
