# Daily Motivation Quotes


## Business Understanding

In our increasingly fast-paced world, people encounter numerous challenges and responsibilities on a daily basis. To address the need for consistent motivation, we propose a data science project that revolves around curating and delivering carefully selected quotes. These quotes, extracted from diverse sources including historical figures, popular literature, and prominent personalities, will serve as a source of encouragement, reflection, and empowerment for individuals.

#### Objectives:

The primary objectives of this project are as follows:
1.	Curate Inspirational Quotes:
Gather a diverse collection of quotes from the Good Reads website, which boasts an extensive compilation of quotes spanning various genres and themes.
2.	Daily Motivational Updates: Develop a system to provide users with daily updates featuring a thoughtfully chosen quote. These updates will cater to different areas of life, ensuring a comprehensive and relatable experience.
3.	Tag-based Grouping: Implement a categorization mechanism that tags each quote based on its thematic content. This grouping will enable users to easily identify quotes that resonate with their specific preferences or current situations.


## Data Understanding

•	Source quotes from the Good Reads website, exploring the wide array of authors and themes available.

•	Analyze the structure of the collected data, including metadata such as author names, publication dates, and associated tags.


In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import scrapy 
#from pathlib import path

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from langdetect import detect
from googletrans import Translator

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from my_functions import translate_to_english, preprocess_text

### 3.	Data Preparation
After obtaining the data we intend to use, we wil now open it here and begin the data cleaning process before proceeding to analysis.

In [2]:
# Reading the data
quotes = pd.read_csv(r'Quotes.csv', index_col=0)

In [3]:
quotes

Unnamed: 0,Author Name,Quote,Tags
0,Oscar Wilde,“Be yourself; everyone else is already taken.”...,"['attributed-no-source', 'be-yourself', 'gilbe..."
1,Marilyn Monroe,"“I'm selfish, impatient and a little insecure....","['attributed-no-source', 'best', 'life', 'love..."
2,Albert Einstein,“Two things are infinite: the universe and hum...,"['attributed-no-source', 'human-nature', 'humo..."
3,Frank Zappa,"“So many books, so little time.” ― F...","['books', 'humor']"
4,Marcus Tullius Cicero,“A room without books is like a body without a...,"['attributed-no-source', 'books', 'simile', 's..."
...,...,...,...
2995,"A.A. Milne,",“I'm not lost for I know where I am. But howev...,
2996,Henry David Thoreau,“Dreams are the touchstones of our characters....,
2997,"Cassandra Clare,",“Black hair and blue eyes are my favorite comb...,
2998,"Nicholas Sparks,",“In times of grief and sorrow I will hold you ...,


We observe that the tags column did not get all the tags for the quotes and we will need to fill them up or remove them if that will not be possible. 
Although removing them will really affect the number of quotes availabe for us to use, therefore removing them will be a last resort. 
We will attempt to fill them based on the author. 

In [4]:
# Checking actual number of missing values. 
quotes.isna().sum()

Author Name      0
Quote            0
Tags           502
dtype: int64

In [5]:
# checking contents of the quote column. 
quotes['Quote'][3]

'“So many books, so little time.”      ―      Frank Zappa'

It appears the quote column still kept the name of the author. We can redo the split again below. We also observe that the quotes have extra quotes that will remain after separating the name from the quote, that will need to be removed as well to only leave a single double quotation mark. 

In [6]:
#  Split the content after the hyphen (―) into 'quote' and 'author' columns
quotes[['quote', 'author']] = quotes['Quote'].str.split('―',  expand=True)

# Strip leading and trailing whitespaces from 'quote' and 'author' columns
quotes['quote'] = quotes['quote'].str.strip()
quotes['author'] = quotes['author'].str.strip()

# Drop the original 'Quote' column since we have extracted its contents
quotes.drop('Quote', axis=1, inplace=True)

In [7]:
nan = quotes[quotes['Tags'].isna()]

In [8]:
nan

Unnamed: 0,Author Name,Tags,quote,author
2498,"John Green,",,“And then something invisible snapped insider ...,"John Green, Looking for Alaska"
2499,Aristotle,,“Hope is a waking dream.”,Aristotle
2500,Annie Proulx,,“You should write because you love the shape o...,Annie Proulx
2501,Bill Watterson,,“I'm killing time while I wait for life to sho...,Bill Watterson
2502,Alex Haley,,"“Either you deal with what is the reality, or ...",Alex Haley
...,...,...,...,...
2995,"A.A. Milne,",,“I'm not lost for I know where I am. But howev...,"A.A. Milne, Winnie-the-Pooh"
2996,Henry David Thoreau,,“Dreams are the touchstones of our characters.”,Henry David Thoreau
2997,"Cassandra Clare,",,“Black hair and blue eyes are my favorite comb...,"Cassandra Clare, Clockwork Angel"
2998,"Nicholas Sparks,",,“In times of grief and sorrow I will hold you ...,"Nicholas Sparks, The Notebook"


In [9]:
# preview the changes
quotes['quote'][53]

"“If you don't stand for something you will fall for anything.”"

In [10]:
quotes.groupby('Author Name').sum()

Unnamed: 0_level_0,Tags,quote,author
Author Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"A. A. Milne,",['antolini'],“It is more fun to talk with someone who doesn...,"A. A. Milne, Winnie-the-Pooh"
A.A. Milne,"['happiness', 'hope', 'inspirational', 'new-ye...","“Weeds are flowers, too, once you get to know ...",A.A. MilneA.A. MilneA.A. MilneA.A. MilneA.A. M...
"A.A. Milne,",['live-death-love']['activism']['dave-matthews...,"“Piglet sidled up to Pooh from behind. ""Pooh!""...","A.A. Milne, The House at Pooh CornerA.A. Miln..."
A.J. Cronin,['writing'],"“Worry never robs tomorrow of its sorrow, but ...",A.J. Cronin
Abigail Van Buren,['life'],“The best index to a person's character is how...,Abigail Van Buren
...,...,...,...
جلال الدين الرومي,0,“لا تجزع من جرحك، وإلا فكيف للنور أن يتسلل إلى...,جلال الدين الرومي
عباس محمود العقاد,0,“ليس هناك كتابا أقرأه و لا أستفيد منه شيئا جدي...,عباس محمود العقاد
غسان كنفاني,['identity'],“!لك شيء في هذا العالم.. فقم”,غسان كنفاني
محمود درويش,"['disappointment', 'dorian-gray', 'marriage', ...",“و كن من أنتَ حيث تكون و احمل عبءَ قلبِكَ وحدهُ”,محمود درويش


Checking the distribution of the authors and their quotes, we realize that some of them are not actually in english and this would affect the outcome of out=r code when filling the nan tags. we therefore need to translate them to English before preprocessing them for the fill. 
we can do that using the langdetect package available  in python together with googletrans package. 
we will install them using Pip then restart kernel and import them with the other packages.

In [11]:
# Apply the translation function from my function file to the quote column
quotes['quote_2'] = quotes['quote'].apply(translate_to_english)


In [12]:
# Apply the translation function from my function file to the author column
quotes['author_2'] = quotes['author'].apply(translate_to_english)


In [1]:
quotes.groupby('Author Name').sum()

NameError: name 'quotes' is not defined

Here now we will begin the process of filling the nan Tags in the dataset. We will need to preprocess the columns before trying t get similarity to ensure accuracy. Preprocessing the quotes before passing them to the TF-IDF matrix is typically necessary to ensure that the text data is in a suitable format for similarity calculations. Preprocessing helps in reducing noise and improving the accuracy of similarity measurements.

In [11]:
# Step 1: Preprocess Quotes
quotes['quote_1'] = quotes['quote_2'].fillna('').apply(preprocess_text)
quotes['author_1'] = quotes['author_2'].fillna('').apply(preprocess_text)

In [14]:
# Step 2: Vectorize Quotes
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(quotes['quote_1'].fillna(''))
cosine_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Step 3: Define a Threshold (You can adjust this threshold)
similarity_threshold = 0.1

# Step 4: Fill NaN Values
for idx, row in quotes.iterrows():
    if pd.isna(row['Tags']):
        if similar_quotes := [
            (quotes.iloc[i]['Tags'], cosine_similarities[idx, i])
            for i in range(len(quotes))
            if not pd.isna(quotes.iloc[i]['Tags'])
            and cosine_similarities[idx, i] > similarity_threshold
        ]:
            # Collect unique tags from similar quotes using a set
            all_tags = {tags for tags, _ in similar_quotes}
            quotes.at[idx, 'Tags'] = ', '.join(all_tags)

In [13]:
list(quotes.Tags)

["'life'",
 "'love'",
 "'humor'",
 "'love'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'humor'",
 "'inspirational'",
 "'inspirational'",
 "'love'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'humor'",
 "'inspirational'",
 "'inspirational'",
 "'life'",
 "'life'",
 "'life'",
 "'love'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'humor'",
 "'humor'",
 "'life'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'life'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'life'",
 "'life'",
 "'inspirational'",
 "'love'",
 "'love'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'life'",
 "'inspirational'",
 "'life'",
 "'life'",
 "'life'",
 "'inspirational'",
 "'inspirational'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'life'",
 "'l