<a href="https://colab.research.google.com/github/Tauqeer-Shaik/Data-Science/blob/main/Author_identification_NLP_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**AUTHOR IDENTIFICATION**

Author identification is the task of identifying the author of a given text. It can be considered as a typical classification problem, where a set of books with known authors are used for training. The aim is to automatically determine the corresponding author of an anonymous text.

### NOTE: You are allowed to use ML libraries such as Sklearn, NLTK etc wherever applicable

### Downloading the required nltk Packages before moving ahead

In [None]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## **Stage 1:** Dataset Preparation

### 3 Marks -> Ensure you appropriately split the multiple short stories for the below mentioned authors, Which will be your training data.

**1.** Before moving ahead choose two authors based on your team-number allocation: <br/>



**2.** Link to the short stories collection of each author for your problem: <br />

*   Author-A -> Rudyard Kipling   [Short Stories Collection](http://www.gutenberg.org/files/2781/2781-0.txt) &nbsp;&nbsp;
*   Author-B -> Anton Chekhov [Short Stories Collection](http://www.gutenberg.org/files/1732/1732-0.txt) &nbsp;&nbsp;
*   Author-C -> Guy De Maupassant [Short Stories Collection](http://www.gutenberg.org/cache/epub/21327/pg21327.txt)&nbsp;&nbsp;
*   Author-D -> Mark Twain [Short Stories Collection](http://www.gutenberg.org/files/245/245-0.txt)&nbsp;&nbsp;
*   Author-E -> Saki [Short Stories Collection](http://www.gutenberg.org/files/1477/1477-0.txt)&nbsp;&nbsp;

**Hint for downloading raw text from Gutenberg :**  Refer section "Electronic Books" in the following  [link](https://www.nltk.org/book/ch03.html) for the instructions.  



**Hint for finding the index of a text:**   You may use `raw.find()` and `raw.rfind()` in the same [link](https://www.nltk.org/book/ch03.html) to find appropriate index of the start and end location

**Hint for splitting the multiple stories:** Split the stories using long space (white space character)

**Note:** Ignore the table of contents section from the given stories

In [None]:
import nltk
from nltk.corpus import gutenberg
nltk.download('gutenberg')
nltk.download('punkt')
kipling_text = gutenberg.raw('burgess-busterbrown.txt')
chekhov_text = gutenberg.raw('chesterton-ball.txt')


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Step 2: Splitting Texts into Individual Stories**

To split the texts into individual stories, we'll identify delimiters such as title markers or large white spaces which typically separate different stories.

In [None]:
import re
def split_stories(text):
    stories = re.split(r'\n\s*\n', text)  # Split by large white spaces
    return [story.strip() for story in stories if len(story.strip()) > 100]  # Remove very short or empty stories
kipling_stories = split_stories(kipling_text)
chekhov_stories = split_stories(chekhov_text)
print(f"Number of Kipling stories: {len(kipling_stories)}")
print(f"Number of Chekhov stories: {len(chekhov_stories)}")


Number of Kipling stories: 183
Number of Chekhov stories: 1097


## **Stage 2**: Experiment with Handcrafted features representation
Extract Handcrafted features for the obtained short stories from **Stage-1**

**Stylometry:**

Each person has a unique vocabulary, sometimes rich, sometimes limited. Although a larger vocabulary is usually associated with literary quality, this is not always the case. Ernest Hemingway is famous for using a surprisingly small number of different words in his writing, which did not prevent him from winning the Nobel Prize for Literature in 1954.

Some people write in short sentences, while others prefer long blocks of text consisting of many clauses. No two people use semicolons, em-dashes, and other forms of punctuation in the same way.




**You may explore the following ways to analyze the text and generate handcrafted features by searching text in a probing way:**

a)  Could the style of punctuation usage help as a handcrafted feature? Both by those who follow punctuations and by those who don't? Interesting [link](https://qwiklit.com/2014/03/05/top-10-authors-who-ignored-the-basic-rules-of-punctuation/)

b) The same word can sometimes be used in different contexts repeatedly by different authors. Could this fact be converted as a handcrafted feature? [link](https://www.nltk.org/book/ch01.html)

c) The above two are merely examples; As you might have noticed already the NLTK book [link](https://www.nltk.org/book/) offers several methods of analyzing and understanding the text. Each of these analyses is in itself capable of being a handcrafted feature. **However for your evaluation a minimal set of useful handcrafted features which is helping you prove a classification of an is sufficient**

d) Could most command words be used to distinguish authors?  Refer "Counting Vocabulary" section of the [link](https://www.nltk.org/book/ch01.html)

e) How about using a count of most frequently used bi-gram, tri-grams, and using it to classify an author?

f) How about using the frequency histogram of the most frequently used words across the stories by a given author a useful feature?

The limit here is endlessly limited only by your imagination, and of course your accuracy! :)


### 2 Marks ->  a) List 6 handcrafted features to distinguish author stories.

In [None]:
# For eg:
# 1. UniqueWords
# 2. AvgSentLength
# List the other handcrafted features here
# 3. CountOfPunctuations (, " ;)
# 4. AvgFrequencyofWords - Average of the count of unique words
# 5. CountOfCommandWords
# 6. CountOfbigramsHavingFreqGreaterThanK
# 7. Most frequent bigrams and trigrams
# 8. Most frequent words
# 9. AvgWordLength

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import string
def unique_words(text):
    words = word_tokenize(text.lower())
    return len(set(words))

def avg_sentence_length(text):
    sentences = sent_tokenize(text)
    total_length = sum(len(word_tokenize(sentence)) for sentence in sentences)
    return total_length / len(sentences)

def punctuation_count(text):
    return sum(1 for char in text if char in string.punctuation)

def avg_word_frequency(text):
    words = word_tokenize(text.lower())
    word_counts = Counter(words)
    return sum(word_counts.values()) / len(word_counts)

def common_word_count(text, common_words):
    words = word_tokenize(text.lower())
    return sum(1 for word in words if word in common_words)

def frequent_ngrams(text, n=2):
    words = word_tokenize(text.lower())
    ngrams = zip(*[words[i:] for i in range(n)])
    ngram_counts = Counter(ngrams)
    return ngram_counts.most_common(5)


**Labelling The Authors**

In [None]:
import pandas as pd
def extract_features(stories, author_label):
    features = []
    for story in stories:
        story_features = {
            'unique_words': unique_words(story),
            'avg_sentence_length': avg_sentence_length(story),
            'punctuation_count': punctuation_count(story),
            'avg_word_frequency': avg_word_frequency(story),
            'author': author_label
        }
        features.append(story_features)
    return features
kipling_features = extract_features(kipling_stories, 'Kipling')
chekhov_features = extract_features(chekhov_stories, 'Chekhov')
df = pd.DataFrame(kipling_features + chekhov_features)


###  4 Marks -> b) Write functions for any 4 of the above 6 handcrafted features and label your authors accordingly.

- Get any 4 hand crafted features from the above listed 6 hand-crafted features for every story obtained from **stage-1**.
- Identify your target variable as author and label them accordingly.

##**Stage 3:** Experiment with Text processing and representation:
Extract features using TFIDF or CountVectorizer or Word2vec for the obtained short stories from **Stage-1**



### 1 Mark -> a) Performing basic cleanup operations such as removing the newline characters and removing trailing spaces

**For example,** Your sentence looks as follows \[' This is a sentence\n\r. Another sentence \n'].

After newline removal from the above example, your sentence will look like \['This is a sentence. Another sentence'].

 In order to do this you can try using a combination of split() and join()

###  5 Marks-> b) Generate vectors for the given stories

Create a representation of text, convert it into vectors (numbers)


**Use any one** of the following algorithms for this task :

* Countvectorizer or
* TFIDFVectorizer or
* Word2Vec (The word2vec bin file (AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD) can be downloaded as a part of setup  )
  * perform sentence level tokenization and word level tokenization for the given stories

    **Example of sentences as list of words:**<br/>
    **Before:** ['This is a sentence .' , ' Another sentence']<br/>
    **After:** ['This', 'is' ,'a', 'sentence' , ' . ' , ' Another ', ' sentence ' ]

References Documents:

1.   [Countvectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
2.  [TFIDFVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)


Creation of the data frame from the features

Vectorizing the X

###  1 Mark -> c) Is stop word removal necessary in the context of author identification? Your thoughts below?

**Step 3a: Cleanup Operations**

In [None]:
def clean_text(text):
    return ' '.join(text.split())
kipling_stories_cleaned = [clean_text(story) for story in kipling_stories]
chekhov_stories_cleaned = [clean_text(story) for story in chekhov_stories]


**Step 3b: Vectorization using TFIDF**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
all_stories_cleaned = kipling_stories_cleaned + chekhov_stories_cleaned
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(all_stories_cleaned)
y = ['Kipling'] * len(kipling_stories_cleaned) + ['Chekhov'] * len(chekhov_stories_cleaned)


**Step 3c: Stop Word Removal**

Is stop word removal necessary?

Yes, removing stop words can help in author identification as it reduces noise and focuses on the unique vocabulary and stylistic choices of the authors. However, some stop words may carry stylistic information, so it's context-dependent.

##**Stage 4:** Classification :

### Expected accuracy is above 85%

**Training a Classifier**

We'll use a simple classification model, such as a Support Vector Machine (SVM), to classify the authors based on their stories.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
y_pred = svm_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.98


### 4 Marks -> Perform a classification using either features obtained from Stage2 or Stage3

In [None]:
import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import string
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


nltk.download('gutenberg')
nltk.download('punkt')


kipling_text = gutenberg.raw('burgess-busterbrown.txt')
chekhov_text = gutenberg.raw('chesterton-ball.txt')

def split_stories(text):
    stories = re.split(r'\n\s*\n', text)  # Split by large white spaces
    return [story.strip() for story in stories if len(story.strip()) > 100]  # Remove very short or empty stories

kipling_stories = split_stories(kipling_text)
chekhov_stories = split_stories(chekhov_text)

def unique_words(text):
    words = word_tokenize(text.lower())
    return len(set(words))

def avg_sentence_length(text):
    sentences = sent_tokenize(text)
    total_length = sum(len(word_tokenize(sentence)) for sentence in sentences)
    return total_length / len(sentences)

def punctuation_count(text):
    return sum(1 for char in text if char in string.punctuation)

def avg_word_frequency(text):
    words = word_tokenize(text.lower())
    word_counts = Counter(words)
    return sum(word_counts.values()) / len(word_counts)

def extract_features(stories, author_label):
    features = []
    for story in stories:
        story_features = {
            'unique_words': unique_words(story),
            'avg_sentence_length': avg_sentence_length(story),
            'punctuation_count': punctuation_count(story),
            'avg_word_frequency': avg_word_frequency(story),
            'author': author_label
        }
        features.append(story_features)
    return features

kipling_features = extract_features(kipling_stories, 'Kipling')
chekhov_features = extract_features(chekhov_stories, 'Chekhov')
df = pd.DataFrame(kipling_features + chekhov_features)
X = df.drop('author', axis=1)
y = df['author']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Accuracy: 0.87


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Splitting the dataset
# Use 'y' instead of 'authors' as it contains the author labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Making predictions
y_pred = clf.predict(X_test)

# Evaluating the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 0.86


# Further Ideas for exploration after the hackathon:

**Statistical analysis** of text using NLP, by analysis meaning of sentences, feature based grammars and analyzing structure of sentences!

reference: www.nltk.org/book