<a href="https://colab.research.google.com/github/Arvinzaheri/data_colab_task/blob/main/Data_analys.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#firt of all we load the data frpm the google drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
path = "/content/drive/MyDrive/Colab Notebooks/booksummaries.txt"

# Data Analysis and Cleaning

## Problem Identification
Upon initial inspection of the dataset, we identified several issues that could hinder our data analysis process. The data was in a text file with inconsistent separators, sometimes a space, sometimes a tab, and other times different characters. Additionally, the dataset did not have column names, making it difficult to understand the context of the data. There were also various symbols and additional characters like `:::::` instead of a single `:`.

## Solution
To address these issues, we decided to use Python's `pandas` library to convert the text file into a DataFrame, a two-dimensional tabular data structure with labeled axes (rows and columns). We also used Python's `re` (regular expression) module to clean the data.

### Data Loading
We read the file line by line, splitting each line into columns based on the separators (space or tab). This gave us a list of columns for each line, which we appended to our data list.

### Data Cleaning
We created a `clean_data` function that uses regular expressions to remove unwanted characters from the text. Specifically, we wrote a regex pattern to match any character that is not a letter, a number, or a space, and replaced these characters with nothing (effectively removing them).

```python
def clean_data(text):
    # Keep only alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text


### Separator Inconsistency
One of the challenges we faced was the inconsistency in the separators used in the data. In some instances, a space was used, in others, a tab, and sometimes different characters altogether. This inconsistency made it difficult to correctly split the data into separate fields.

### Handling Separator Inconsistency
To handle this, we used Python's `re` (regular expression) module, which provides flexible pattern-matching capabilities. We used the `re.split()` function to split each line into columns. This function allows us to specify multiple separators by placing them inside a character class in the regex pattern. In our case, we used `\t|  `, which matches either a tab (`\t`) or two spaces (`  `).

Here's the relevant part of the code:

```python
# Split the line into columns
columns = re.split(r'\t|  ', line)



In [None]:
import pandas as pd
import re

# Define a function to clean the data
def clean_data(text):
    # Keep only alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text


# Read the file
with open(path, 'r') as file:
    lines = file.readlines()

# Split the lines into columns and clean the data
data = []
for line in lines:
    # Split the line into columns
    columns = re.split(r'\t|  ', line)
    # Clean the data
    columns = [clean_data(column) for column in columns]
    data.append(columns)

# Convert the data into a DataFrame
df = pd.DataFrame(data)

# If you want to add column names
df.columns = ['Id', 'Column2', 'Booktitle', 'Author', 'Year', 'Genre', "Summary", 'Summary_Usless']


In [None]:
df.head()

Unnamed: 0,Id,Column2,Booktitle,Author,Year,Genre,Summary,Summary_Usless
0,620,m0hhy,Animal Farm,George Orwell,19450817.0,m016lj8 Roman u00e0 clef m06nbt Satire m0dwly ...,Old Major the old boar on the Manor Farm call...,
1,843,m0k36,A Clockwork Orange,Anthony Burgess,1962.0,m06n90 Science Fiction m0l67h Novella m014dfn ...,Alex a teenager living in nearfuture England ...,
2,986,m0ldx,The Plague,Albert Camus,1947.0,m02m4t Existentialism m02xlf Fiction m0pym5 Ab...,The text of The Plague is divided into five p...,
3,1756,m0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...,
4,2080,m0wkt,A Fire Upon the Deep,Vernor Vinge,,m03lrw Hard science fiction m06n90 Science Fic...,The novel posits that space around the Milky ...,


### Genre Cleaning
The genre column in our data contained encoded identifiers along with the actual genre names. These identifiers were not meaningful for our analysis and made the data harder to understand.

### Handling Genre Cleaning
To clean the genre column, we created a `clean_genre` function that uses regular expressions to remove these identifiers and keep only the genre names. The function removes any character that is not a letter or a space, effectively keeping only words.

Here's the `clean_genre` function:

```python
def clean_genre(text):
    # Remove words that contain numbers
    text = re.sub(r'\b\w*\d\w*\b', '', text)
    return text


In [None]:
def clean_genre(text):
    # Remove words that contain numbers
    text = re.sub(r'\b\w*\d\w*\b', '', text)
    return text


df['Genre'] = df['Genre'].apply(clean_genre)


In [None]:
import numpy as np

# Replace empty strings with np.nan
df.replace("", np.nan, inplace=True)

In [None]:
#lets check the nan values
df.isna().sum()

Booktitle       0
Author       2382
Genre        3719
Summary         0
dtype: int64

In [None]:
#we can drop Id, Column2, year and last summary columns
df.drop(columns=['Id', 'Column2', 'Year', 'Summary_Usless'],inplace=True)

In [None]:
df.head()

Unnamed: 0,Booktitle,Author,Genre,Summary
0,Animal Farm,George Orwell,Roman clef Satire Childrens literature Sp...,Old Major the old boar on the Manor Farm call...
1,A Clockwork Orange,Anthony Burgess,Science Fiction Novella Speculative fiction...,Alex a teenager living in nearfuture England ...
2,The Plague,Albert Camus,Existentialism Fiction Absurdist fiction N...,The text of The Plague is divided into five p...
3,An Enquiry Concerning Human Understanding,David Hume,,The argument of the Enquiry proceeds by a ser...
4,A Fire Upon the Deep,Vernor Vinge,Hard science fiction Science Fiction Specul...,The novel posits that space around the Milky ...


### Handling Missing Values in Genre

The genre column in our data contained some missing values, represented as empty strings (""). Missing values can hinder our data analysis process, so it's important to handle them appropriately.

## Filling Missing Values

There are several strategies we can use to fill these missing values:

1. **Group by Author**: One approach is to group the data by the author, and fill the missing genre values based on the most common genre for each author. This assumes that an author typically writes in the same genre.

```python
# Fill missing values based on author's most common genre
df['Genre'] = df.groupby('Author')['Genre'].apply(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else ""))
```

2. **Use NLP Models**: If there are still a lot of missing values, we can use Natural Language Processing (NLP) models or Language Model (LM) to predict the genre based on other information, such as the book summary.

3. **Manual Filling**: If the number of missing values is small, we could also manually determine the missing values by researching the genre of the specific books.


In [None]:
df.groupby('Author')['Genre'].value_counts()

Author         Genre                           
                                                   1596
                Science Fiction                     106
                Childrens literature                 74
                Novel                                61
                Fantasy                              48
                                                   ... 
mile Zola                                             1
                Short story                           1
                Psychological novel                   1
                Psychology  Psychological novel       1
sne Seierstad   Nonfiction                            1
Name: count, Length: 9638, dtype: int64

# Genre Searching Solution

after trying some simple solutions, I came up with this one.
we don't have any missing data in the Book_name column, so we can use it as a keyword
to search for the genre.


In [108]:
import requests
from bs4 import BeautifulSoup

def get_book_genre(book_name):
    # Replace spaces with '+' for the URL
    book_name = book_name.replace(' ', '+')

    # Use Google to search for the book
    url = f"https://www.google.com/search?q={book_name}+book+genre"

    # Send a request to the website
    response = requests.get(url)

    # Parse the HTML content of the page with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the genre of the book
    genre_list = soup.find_all('div', {'class': 'BNeawe s3v9rd AP7Wnd'})

    # Extract the text from each genre element and store them in a list
    genres = [genre.get_text() for genre in genre_list if "Genre" in genre.get_text() or "Subject" in genre.get_text()]

    try:
      genres = genres[0]

      #the output is Genres: Allegory, Fable, Satire, and more we want the first tree genres in a list
      _, genres = genres.split(": ")
      genres = np.array(genres.split(" "))
      if len(genres) > 3:
          genres = genres[:3]
      else:
          pass
      return genres
    except:
      genres = [genre.get_text() for genre in genre_list]
      print(genres)
      return np.nan


book_name = "Animal Farm"  # Example book
print(get_book_genre(book_name))
#the output is Genres: Allegory, Fable, Satire, and more we want the first tree genres in a list



['Allegory,' 'Fable,' 'Satire,']


In [179]:
import requests

def get_book_genre(book_name):
    # Replace spaces with '+' for the URL
    book_name = book_name.replace(' ', '+')

    # Use the Google Books API to search for the book
    url = f"https://www.googleapis.com/books/v1/volumes?q={book_name}"

    # Send a request to the API
    response = requests.get(url)

    # Convert the response to JSON
    data = response.json()

    # Get the first book from the results
    try:
      book = data['items'][0]
    except:
      print(book_name)
      return np.nan

    # Get the genres (categories) of the book
    #print(book)
    genres = book['volumeInfo'].get('categories', [])

    #print(1)
    try:
      return np.array(genres[0])
    except:
      i = 2
      while len(genres) < 1:
        book = data['items'][i]
        genres = book['volumeInfo'].get('categories', [])
        i += 1
      return np.array(genres[0])


book_name = "Hamlet"  # Example book
print(f"The genres of '{book_name}' are {get_book_genre(book_name)}")


The genres of 'Hamlet' are Drama


In [180]:
def genre_finder(row):
    return get_book_genre(row['Booktitle'])

In [181]:
fill = df[df['Genre'].isna()].apply(genre_finder, axis=1)

King+John
The+History+of+Rasselas+Prince+of+Abissinia
The+Shooting+Star
Le+Pre+Goriot
Grim+the+Collier+of+Croydon
This+Present+Darkness
The+Wasps
If+I+Forget+Thee+Jerusalem
Venice+Preservd
Unidentified+Human+Remains+and+the+True+Nature+of+Love
Asterix+and+the+Black+Gold
La+Femme+pige
Once+on+a+Time


IndexError: list index out of range

In [176]:
df.isna().sum()

Booktitle       0
Author       2382
Genre        3719
Summary         0
dtype: int64

In [None]:
#we could use distilbert model too to get the genre of the missing genres
#but we need to train the model on the text data so i think we need to use the summarizer model first then we can get the genre of the missing genres so we will do that on the NLP notebook

In [182]:
#now we only need Genre and Summary columns so we save them as a .csv file
df[['Genre', 'Summary']].to_csv('booksummaries.csv', index=False)

In [183]:
#we upload the .csv file to the google drive
!cp booksummaries.csv /content/drive/MyDrive