# 📌 Task 2: Exploratory Data Analysis (EDA)
### Internship - CodeAlpha

---

## 🎯 Objective

In this task, we will explore the quotes dataset scraped from Goodreads in Task 1.  

Goals:
- Understand the dataset structure  
- Identify data types, missing values, and patterns  
- Analyze authors, tags, and quote content  
- Prepare insights for visualization and sentiment analysis

---

## 🛠️ Libraries Required

- Pandas
- Matplotlib
- Seaborn
- Counter

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

---

## 📂 Step 1: Load the Scraped Data


In [None]:
# Load the CSV file saved in Task 1
df = pd.read_csv("quotes_data.csv")

---

## 🧠 Step 2: Basic Data Overview


In [None]:
# Shape and info
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
df.info()

Rows: 150, Columns: 3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   quote   150 non-null    object
 1   author  150 non-null    object
 2   tags    139 non-null    object
dtypes: object(3)
memory usage: 3.6+ KB


In [None]:
df.head()

Unnamed: 0,quote,author,tags
0,“Be yourself; everyone else is already taken.”...,Oscar Wilde,"tags:attributed-no-source,be-yourself,gilbert-..."
1,"“I'm selfish, impatient and a little insecure....",Marilyn Monroe,"tags:attributed-no-source,best,life,love,misat..."
2,"“So many books, so little time.” ― Frank Zappa",Frank Zappa,"tags:books,humor"
3,“Two things are infinite: the universe and hum...,Albert Einstein,"tags:attributed-no-source,human-nature,humor,i..."
4,“A room without books is like a body without a...,Marcus Tullius Cicero,"tags:attributed-no-source,books,simile,soul"


## 🚿 Step 3: Data Cleaning

In [None]:
# Check missing values
df.isnull().sum()

Unnamed: 0,0
quote,0
author,0
tags,11


In [None]:
nan_tags_df=df[df['tags'].isna()]
nan_tags_df

Unnamed: 0,quote,author,tags
14,“I've learned that people will forget what you...,Maya Angelou,
16,“To live is the rarest thing in the world. Mos...,Oscar Wilde,
17,“A friend is someone who knows all about you a...,Elbert Hubbard,
25,“Here's to the crazy ones. The misfits. The re...,Steve Jobs,
62,“Never put off till tomorrow what may be done ...,Mark Twain,
82,"“We don't see things as they are, we see them ...",Anaïs Nin,
92,“Beauty is in the eye of the beholder and it m...,Jim Henson,
98,“It is impossible to live without failing at s...,J.K. Rowling,
111,"“The more that you read, the more things you w...","Dr. Seuss,",
120,"“When one door of happiness closes, another op...",Helen Keller,


In [None]:
# show the quotes of NaN tags as a list of strings with their indexes in numbers
for index, row in nan_tags_df.iterrows():
    print(f"{index}. {row['quote']}")

14. “I've learned that people will forget what you said, people will forget what you did, but people will never forget how you made them feel.” ― Maya Angelou
16. “To live is the rarest thing in the world. Most people exist, that is all.” ― Oscar Wilde
17. “A friend is someone who knows all about you and still loves you.” ― Elbert Hubbard
25. “Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.” ― Steve Jobs
62. “Never put off till tomorrow what may be done day after tomorrow just as well.” ― Mark

In [None]:
# fill the NaN tags manually by readding quotes
df.loc[14, 'tags'] = 'tags:feelings,inspirational,kindness,empathy,life'
df.loc[16, 'tags'] = 'tags:life,rare,living,existence,oscar-wilde'
df.loc[17, 'tags'] = 'tags:friendship,love,acceptance,true-friends'
df.loc[25, 'tags'] = 'tags:change,genius,inspirational,individuality,rebellion'
df.loc[62, 'tags'] = 'tags:humor,procrastination,mark-twain,life'
df.loc[82, 'tags'] = 'tags:perception,philosophy,inspirational'
df.loc[92, 'tags'] = 'tags:beauty,humor,perception,truth'
df.loc[98, 'tags'] = 'tags:failure,courage,inspirational,living,jk-rowling'
df.loc[111, 'tags'] = 'tags:books,learning,education,reading,dr-seuss'
df.loc[120, 'tags'] = 'tags:happiness,opportunity,hope,inspirational,helen-keller'
df.loc[142, 'tags'] = 'tags:infinity,philosophy,love,life,john-green'

In [None]:
# remove the tags word from the tags column
df['tags']=df['tags'].str.replace('tags:', '')

In [None]:
# split the tags
df['tags']=df['tags'].str.split(',')

---

## 🔍 Step 4: Explore Key Features

We'll look at:
- Quote length
- Most quoted authors  
- Common tags  
- Sample quote lengths


In [None]:
# quote length analysis
df['quote_length'] = df['quote'].apply(len)
df['quote_length'].describe()

Unnamed: 0,quote_length
count,150.0
mean,179.54
std,243.815576
min,46.0
25%,90.25
50%,118.0
75%,160.75
max,2483.0


In [None]:
# Most quoted authors
top_authors = df['author'].value_counts().head(10)
top_authors


Unnamed: 0_level_0,count
author,Unnamed: 1_level_1
"J.K. Rowling,",8
Albert Einstein,7
Marilyn Monroe,5
Mark Twain,5
Oscar Wilde,5
"John Green,",5
Bob Marley,4
"J.R.R. Tolkien,",3
"Stephen Chbosky,",3
Dr. Seuss,3


In [None]:
all_tags = sum(df['tags'].tolist(), [])  # flatten nested list
tag_counts = Counter(all_tags)

# Convert to DataFrame
tag_df = pd.DataFrame(tag_counts.items(), columns=['Tag', 'Count']).sort_values(by='Count', ascending=False)
tag_df.head(10)

Unnamed: 0,Tag,Count
9,love,32
4,inspirational,31
8,life,24
0,attributed-no-source,22
16,humor,21
15,books,20
150,reading,11
44,friendship,9
43,friends,6
19,philosophy,6


In [None]:
# check for any duplicates
df.duplicated(subset='quote').sum()

np.int64(0)

In [None]:
# save the file
df.to_csv('quotes_data_cleaned.csv', index=False)

## 🔍 Summary

- Cleaned the `tags` column by handling NaN and splitting by commas.
- Found quote lengths mostly lie between ___ and ___ characters.
- Top authors: Albert Einstein, Oscar Wilde, etc.
- Common themes in quotes are: life, love, inspiration, truth.

Data is ready for [**Visualization**](https://github.com/GhulamMuhammadNabeel/code_alpha_data_analysis/blob/main/CodeAlpha_task3_Visualization.ipynb) and [**Sentiment Analysis**](https://github.com/GhulamMuhammadNabeel/code_alpha_data_analysis/blob/main/CodeAlpha_task4_sentiment_analysis.ipynb).
