## In this notebook we will preprocess the data from the excel file of anime character quotes and then we will extract the keywords from it and then we will make a new dataframe which will contain quote and keywords.

# Importing Pandas

In [None]:
import pandas as pd

## Reading the csv file containing quotes

In [None]:
df = pd.read_csv('/content/lessreal-data.csv',sep=';')

In [None]:
df = df.loc[: , ['Quote']]

In [None]:
df.head()

Unnamed: 0,Quote
0,In the end the shape and form don't matter at ...
1,"I'm still a man too, I wanted to look calm and..."
2,"Clausewitz, he pointed out that no matter how ..."
3,Because of the existence of love - sacrifice i...
4,Courage is a word of justice. It means the qua...


## Yet Another Keyword Extractor Installation

We will install yake from pip and will apply on the anime character movie dialogs to extract the most important keywords from the dialog line.

In [None]:
!pip3 install yake

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting yake
  Downloading yake-0.4.8-py2.py3-none-any.whl (60 kB)
[?25l[K     |█████▌                          | 10 kB 30.9 MB/s eta 0:00:01[K     |███████████                     | 20 kB 15.2 MB/s eta 0:00:01[K     |████████████████▍               | 30 kB 9.6 MB/s eta 0:00:01[K     |█████████████████████▉          | 40 kB 5.8 MB/s eta 0:00:01[K     |███████████████████████████▎    | 51 kB 7.0 MB/s eta 0:00:01[K     |████████████████████████████████| 60 kB 4.0 MB/s 
Collecting segtok
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting jellyfish
  Downloading jellyfish-0.9.0.tar.gz (132 kB)
[K     |████████████████████████████████| 132 kB 13.7 MB/s 
Building wheels for collected packages: jellyfish
  Building wheel for jellyfish (setup.py) ... [?25l[?25hdone
  Created wheel for jellyfish: filename=jellyfish-0.9.0-cp37-cp37m-linux_x86_64.whl size=74015 sha256

# Extracting the keyword from the line 

In [None]:
import yake
from operator import itemgetter

def extract_keyword(text):
  # Initializing the keyword extractor
  kw_extractor = yake.KeywordExtractor()
  # Extracting the keywords
  keywords = kw_extractor.extract_keywords(text)
  if keywords:
    # If there is any important word in the text
    # Yake will output a tuple of important kw with their importance score
    # We will get the top word with highest importance score
    key = max(keywords,key=itemgetter(1))[0]
  else:
    key = 'raw'
  return key

In [None]:
# Inserting a new column in the dataframe with keywords
df['Keyword'] = df['Quote'].apply(lambda x:extract_keyword(x))

In [None]:
# checking the empty data in the dataframe
df.isna().sum()

Quote      404
Keyword      0
dtype: int64

In [None]:
# dropping the data with empty values 
df.dropna(inplace=True)

In [None]:
df.isna().sum()

Quote      0
Keyword    0
dtype: int64

## Refined Dataset

In [None]:
df.head()

Unnamed: 0,Quote,Keyword
0,In the end the shape and form don't matter at ...,matter
1,"I'm still a man too, I wanted to look calm and...",love
2,"Clausewitz, he pointed out that no matter how ...",armchair
3,Because of the existence of love - sacrifice i...,comprehends
4,Courage is a word of justice. It means the qua...,excuse


# Exporting the dataset as csv

In [None]:
df.to_csv("Quotes_with_keyword.csv")