# Daily Motivation Quotes


## Business Understanding

In our increasingly fast-paced world, people encounter numerous challenges and responsibilities on a daily basis. To address the need for consistent motivation, we propose a data science project that revolves around curating and delivering carefully selected quotes. These quotes, extracted from diverse sources including historical figures, popular literature, and prominent personalities, will serve as a source of encouragement, reflection, and empowerment for individuals.

#### Objectives:

The primary objectives of this project are as follows:
1.	Curate Inspirational Quotes:
Gather a diverse collection of quotes from the Good Reads website, which boasts an extensive compilation of quotes spanning various genres and themes.
2.	Daily Motivational Updates: Develop a system to provide users with daily updates featuring a thoughtfully chosen quote. These updates will cater to different areas of life, ensuring a comprehensive and relatable experience.
3.	Tag-based Grouping: Implement a categorization mechanism that tags each quote based on its thematic content. This grouping will enable users to easily identify quotes that resonate with their specific preferences or current situations.


## Data Understanding

•	Source quotes from the Good Reads website, exploring the wide array of authors and themes available.

•	Analyze the structure of the collected data, including metadata such as author names, publication dates, and associated tags.


In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import scrapy 
#from pathlib import path

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk import FreqDist
import plotly.express as px

from langdetect import detect
from googletrans import Translator

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from my_functions import translate_to_english, preprocess_text

: 

### 3.	Data Preparation
After obtaining the data we intend to use, we wil now open it here and begin the data cleaning process before proceeding to analysis.

In [None]:
# Reading the data
quotes = pd.read_csv(r'Quotes.csv', index_col=0)

: 

In [None]:
quotes

: 

We observe that the tags column did not get all the tags for the quotes and we will need to fill them up or remove them if that will not be possible. 
Although removing them will really affect the number of quotes availabe for us to use, therefore removing them will be a last resort. 
We will attempt to fill them based on the author. 

In [None]:
# Checking actual number of missing values. 
quotes.isna().sum()

: 

In [None]:
# checking contents of the quote column. 
quotes['Quote'][3]

: 

It appears the quote column still kept the name of the author. We can redo the split again below. We also observe that the quotes have extra quotes that will remain after separating the name from the quote, that will need to be removed as well to only leave a single double quotation mark. 

In [None]:
#  Split the content after the hyphen (―) into 'quote' and 'author' columns
quotes[['quote', 'author']] = quotes['Quote'].str.split('―',  expand=True)

# Strip leading and trailing whitespaces from 'quote' and 'author' columns
quotes['quote'] = quotes['quote'].str.strip()
quotes['author'] = quotes['author'].str.strip()

# Drop the original 'Quote' column since we have extracted its contents
quotes.drop('Quote', axis=1, inplace=True)

: 

In [None]:
nan = quotes[quotes['Tags'].isna()]

: 

In [None]:
nan

: 

In [None]:
# preview the changes
quotes['quote'][53]

: 

In [None]:
quotes.groupby('Author Name').sum()

: 

Checking the distribution of the authors and their quotes, we realize that some of them are not actually in english and this would affect the outcome of our code when filling the nan tags. we therefore need to translate them to English before preprocessing them for the fill. 
we can do that using the langdetect package available  in python together with googletrans package. 
we will install them using Pip then restart kernel and import them with the other packages.

Next, we will write a function that will do the translation for us called translate to english that is available in out my functions file. 

In [None]:
# Apply the translation function from my function file to the quote column
#quotes['quote_2'] = quotes['quote'].apply(translate_to_english)


: 

Below we will also apply the translate to english function on the author column to have the names in English. the following codes really took a long time running on my local machine, therefore, I opted to run them using cloud services, i.e. google colab then saved the new dataframe to a new file that we will read below. 

In [None]:
# Apply the translation function from my function file to the author column
# quotes['author_2'] = quotes['Author Name'].apply(translate_to_english)


: 

In [None]:
quotes.groupby('Author Name').sum()

: 

In [None]:
# Reading the new file 
quotes_2 = pd.read_csv(r'E:\Documents\data_science\post_capstone\Everyday_Quotes\Everyday_Quotes\quotes_2.csv', index_col=0)

: 

In [None]:
quotes_2.head()

: 

Below we will perfomt some Data wrangling techniques to ensure we have comprehensive dat to work with. 

In [None]:
# Drop duplicate rows across all columns
quotes_2 = quotes_2.drop_duplicates()

: 

In [None]:
# Replace missing values with the most common value of each column in: 'Tags'
quotes_2 = quotes_2.fillna({'Tags': quotes_2['Tags'].mode()[0]})

: 

In [None]:
quotes_2.groupby('Author Name').sum()

: 

#### Visualizations

In [None]:
# Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(quotes_2['quote_2']))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

: 

In [None]:
# Author Contribution Bar Chart
author_counts = quotes_2['author_2'].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=author_counts.values, y=author_counts.index, hue=author_counts.index, palette='viridis', dodge=False)
plt.title('Top 10 Authors Contribution')
plt.xlabel('Number of Quotes')
plt.legend(title='Authors', loc='lower right')
plt.show()

: 

In [None]:
# Tag Distribution Pie Chart
tag_counts = quotes_2['Tags'].value_counts()
fig = px.pie(tag_counts, names=tag_counts.index, title='Tag Distribution')
fig.update_traces(textinfo='percent+label')
fig.show()

: 

In [None]:
# Quote Length Distribution
quote_lengths = quotes_2['quote_2'].apply(len)
plt.figure(figsize=(10, 5))
sns.histplot(quote_lengths, bins=30, kde=True)
plt.title('Quote Length Distribution')
plt.xlabel('Quote Length')
plt.ylabel('Frequency')
plt.show()

: 

In [44]:
# Author vs. Tag Matrix
author_tag_matrix = pd.crosstab(quotes_2['author_2'], quotes_2['Tags'])
plt.figure(figsize=(12, 8))
sns.heatmap(author_tag_matrix, cmap='Blues', cbar_kws={'label': 'Number of Quotes'}, annot=True, fmt='g')
plt.title('Author vs. Tag Matrix')
plt.xlabel('Tags')
plt.ylabel('Authors')
plt.show()

: 

In [43]:
# Network Graph of Authors and Tags
network_graph = px.scatter(quotes_2, x='author_2', y='Tags', title='Network Graph of Authors and Tags', 
                            labels={'author_2': 'Author', 'Tags': 'Tags'})
network_graph.show()