<a id='top'></a>

# NLP 2 - Compare Reviewers Writing Styles via Cosine Angles of Vectors

#### This project uses Natural Language Tool-Kit (NLTK) for text preprocessing, then assesses the similarity of document pairs using the cosine of the angle between two vectors.

Representing documents in vector space based on the counts of words in the documents (see NLP 1) may mistakenly classify documents with similar content as different, due to difference in document size. Computing the cosine similarity of the vector representations of two documents compensates for the effect of document length when quantifying their similarity. This natural language processing project compares the cosine of the angles between sets of three documents (in three pairs) to see if writing styles are similar or different. The goal was to see if expected differences and similarities in writing styles were borne out by computing and comparing the cosine similarity of the vector representations of pairs of documents. 

The documents being compared are movie reviews downloaded from rottentomatoes.com. After some data wrangling, each document is the entirety of a particular reviewers' text reviews. 

For a full description of what documents are being compared and why, see the blog post for this project.

- [Step 1](#Step_1) Data wrangling. Select text of reviewers that are being compared and combine all their reviews into a dataframe with different rows for each reviewer, and a column that is the entire text of their reviews joined into a single string. For each reviewer, create a variable representing the string of all their reviews joined together.  
- [Step 2](#Step_2) Process the text for analysis. Create a function to process text for analysis, including tokenizing (break words into components), stemming with nltk PorterStemmer (consolidate different permutations of words), and removing stopwords (specific articles, prepositions and other words that do not add meaning).
- [Step 3](#Step_3) Compute cosine similarities. Create a function to calculate and compare the cosine of the angles for the three documents in each set. 
- [Step 4](#Step_4) Group reviewers for comparison, and create a for loop to calculate and print cosine similarities of reviewers.
- [Step 5](#Step_5) References. 

In [1]:
import pandas as pd
pd.set_option('mode.chained_assignment', None)
pd.set_option('max_columns', 120)
pd.set_option('max_colwidth', 5000)
pd.options.display.max_rows = 999
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings("ignore")

Back to [top](#top) 

<a id='Step_1'></a>

#### Step 1: Data wrangling.

Select text of reviewers that are being compared and combine all their reviews into a dataframe with different rows for each reviewer, and a column that is the entire text of their reviews joined into a single string. For each reviewer, create a variable representing the string of all their reviews joined together.  

In [2]:
file_loc = "C:/Users/rmbrm/Documents/DS_WP/data/critics.csv"
df = pd.read_csv(file_loc, index_col=None, na_values=['NA'], usecols = ['critic', 'quote'] )
df = df[~df.quote.isnull()]
df.head()

Unnamed: 0,critic,quote
1,Derek Adams,"So ingenious in concept, design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm."
2,Richard Corliss,The year's most inventive comedy.
3,David Ansen,A winning animated feature that has something for everyone on the age spectrum.
4,Leonard Klady,The film sports a provocative and appealing story that's every bit the equal of this technical achievement.
5,Jonathan Rosenbaum,"An entertaining computer-generated, hyperrealist animation feature (1995) that's also in effect a toy catalog."


In [3]:
reviewers = ["Anthony Lane", "Dave Kehr", "Rex Reed", "Jonathan Rosenbaum", "Richard Roeper", "Kenneth Turan",
             "Michael Wilmington", "David Edelstein", "Owen Gleiberman", "Lisa Schwarzbaum"]

df = df.loc[df['critic'].isin(reviewers)]

df2 = df.groupby('critic').agg({
    'quote': lambda x: ' '.join(x)
})

df2.head(2)

Unnamed: 0_level_0,quote
critic,Unnamed: 1_level_1
Anthony Lane,"Brosnan, however, looks set to stay. He'll never recapture the amused cool of the young Sean Connery, but he does overcome the handicap of looking like a humorless male model. In a season of fat blockbusters, a picture as brainy, bitter, and compact as this one comes as a shock and a treat. The two leads' sly comic rhythm is miles removed from the book's failing solemnity. The political argument that ensues is pretty dull, but the battle scenes are the loudest and most convincing in years: Gibson has learned from Kurosawa in lending a clarifying thrust to what is, essentially, chaos. Berkley's acting debut is a joy, if you can call it acting: she jumps up and down a lot to indicate excitement. The talk is dirty and funny, the violence always waiting just around the corner. Warm, wise, and wearisome as hell. The result is clean, delirious, and, yes, speedy -- the best big-vehicle-in-peril movie since Clouzot's The Wages of Fear. That rasping tension is soon smoothed away, as the plot sets off on its daft and hackneyed course. It's a pleasure to find a thriller fulfilling its duties with such gusto: the emotions ring solid, the script finds time to relax into backchat, and for once the stunts look like acts of desperation rather than shows of prowess. This is really quite an achievement. It brings together Jeremy Irons, Meryl Streep, Winona Ryder, Antonio Banderas, and Vanessa Redgrave and insures that, without exception, they all give their worst performances ever. The grand finale? A fistfight, after which somebody gets run over. Listen, if I want to see that kind of action, I don't go to Shanghai. I don't even go to the movies. I go to the South Bronx and stand outside a bar. Its winning formula is driven not by narrative but by what isn't there: no sex, no drugs, and no rock and roll-- unless you count the logjam of thundering power ballads. The funniest thing about The Women is that Mick Jagger is one of the producers. There was a knowing laugh in the theatre as his name sprang up in the opening credits -- our last chance to laugh, as it turned out, for the next two hours. The Pythons are enlightened jesters, whose scorn is reserved for those who persist in walking in darkness. We have yet to recover from his revelation: that there is nothing more real than sitting in your own back yard -- waiting for the unreal to come down, take a handful of candy, and fly you to the moon. We see through a glass darkly, and often confusingly, but at least we see. The triumph of the film lies not just in the force and range of the performances, but in Minghella's creation of an intimate epic: vast landscapes mingle with the minute details of desire, and the combination is transfixing. Action thrillers assail but rarely test us; this is the tautest, most provoking, and altogether most draining example ever made. Terminator Salvation is a confused, humorless grind, with nobody, from the stars to the set designers, prepared to prick its self-importance. Even if you don't buy the main conceit, the scumbled texture of the movie makes it feel not just plausible but recognizable, and Cuaron takes care never to paint the future as consolingly different. The plot is impressively free of anything that does not smell of unpasteurized melodrama. Carrey is on his mettle, but you wonder why thirty years of close observation have made Truman so funny; shouldn't he be a regular guy gone mad? There are no surprises in this movie, and most people will be able to predict, within the first ten minutes, roughly how the last ten will pan out. A dark and fidgety picture from Christopher Nolan, who made such a splash with Memento. After many mishaps, the art of bringing Elmore Leonard's novels to the screen is coming to fruition. This latest adaptation, by director Steven Soderbergh and screenwriter Scott Frank, gets it just about right. Spielberg obviously decided that blood and guts meant just that, and so he arranged his violence into a semblance of pure disorder. The illusion holds, complete with severed limbs and wellsprings of blood, and it feels honorable. As I watched this film, an eager victim of its boundless will to astound, I found my loyal memories of the book beginning to fade. The problem is not that we know the outcome. The problem is the buildup. You cannot help being stirred by the reach and depth, the constant rebuffs to sloppiness, of a strong ensemble."
Dave Kehr,"Wilder's tastelessness now seems his major artistic strength. Kurosawa's film is a model of long-form construction, ably fitting its asides and anecdotes into a powerful suspense structure that endures for all of the film's 208 minutes. As recorded in the great wealth of documentary footage Ofteringer has assembled, the cheekbones slowly collapse and the blue eyes become watery, their owner becoming more and more dependent on hard drugs and fast living. What begins as an anti-Goodbye, Mr. Chips ends, thanks to some psychological point stretching, as an imitation of it. It's a highly professional piece of Hollywood sentimentalism. A serious disappointment, recommended only for inveterate Disney fans and very young people. Like the roller-coaster ride Cliffhanger clearly wants to be, the film sends you out pleasantly rattled and wobbly of gait. It's a beautifully proportioned, wonderfully complete movie. The characters aren't much more well-defined than the anonymous victims of a teen horror movie... The dinosaur effects, however, are absolutely stunning, and sometimes so natural that one even forgets to be impressed. What it offers, apart from the overblown special effects that seem inescapable in American movies, is an unusual and effective combination of swooning, morbid romance and screwball comedy. The pathos of the film is the pathos of its leading character -- it is a magnificent machine, but a machine it remains. It's a gnarled, brutal, highly manipulative film that, at its center, seems morally indefensible. The on-screen carnage established a new level in American movies, but few of the films that followed in its wake could duplicate Peckinpah's depth of feeling. Some of the animation is first-rate, particularly in the more modest comedy segments, and even the heavy set pieces have greater flash and dazzle than anything Ralph Bakshi mustered around the same period. This 1970 animated feature is dull, careless, and all too typical of the Disney studio's slapdash output before the unexpected renaissance of The Rescuers. All in all, a superior genre piece, if not the height of Hill's artistry. The film is ugly on so many levels -- from art direction to human values -- that it's hard to know where to begin. It remains an outstanding example of the filmmaker's power to transform an environment through the selection of detail: everything in it is familiar, but nothing is recognizable. The material is consistently clever and funny, though ultimately the attitudes are too narrow to nourish a feature-length film. No masterpiece, but a prime example of subversive cinema. Rohmer's impossibly light, graceful way of posing profound moral questions hasn't yet wholly coalesced, though this 1966 film does have his soft, slow rhythm. The ultimate family film. George Cukor gives it the royal treatment with a splendid supporting cast... One of the shining glories of the American musical. This story of a party girl in love with a gigolo allows [director Blake] Edwards to create a very handsome film, with impeccable Technicolor photography by Franz Planer. One of the landmarks--not merely of the movies, but of 20th-century art. This is Capra at his best, very funny and very light, with a minimum of populist posturing. A great film, and certainly one of the most entertaining movies ever made, directed by Alfred Hitchcock at his peak. In many ways, the ultimate Billy Wilder film -- replete with breathless pacing, transvestite humor, and unflinching cynicism. A terrifically entertaining comedy-thriller. Who can argue with Bogart's glower or Mary Astor in her ratty fur? Lerner and Loewe's musical masterwork, reimagined for film by director George Cukor. Wyler lays out all the elements with care and precision, but the romantic comedy never comes together -- it's charm by computer. One of the first films to integrate musical numbers into the plot, it explores, without condescension or simplemindedness, the feelings that drive the family members apart and then bring them back together again. I don't find the film light or joyful in the least -- an air of primal menace hangs about it, which may be why I love it. A critic-proof movie if there ever was one: it isn't all that good, but somehow it's great. It is still the best place I know of to start thinking about Welles -- or for that matter about movies in general. It was a freshening attitude then, though its long-term effects haven't been all to the good. The hoped-for tone of Restoration comedy never quite materializes, perhaps because Mankiewicz's cynicism is only skin-deep, but the film's tinny brilliance still pleases. Through its first two-thirds it is as perfect a myth of adolescence as any of the Disney films, documenting the childlike, nameless heroine's initiation into the adult mysteries of sex, death, and identity. This film contains one of Hitchcock's most famous set pieces -- an assassination in the rain -- but otherwise remains a second-rate effort, as immen..."


In [4]:
Lane= df2.loc["Anthony Lane", "quote"]
Kehr= df2.loc["Dave Kehr", "quote"]
Reed= df2.loc["Rex Reed", "quote"]
Rosenbaum= df2.loc["Jonathan Rosenbaum", "quote"]
Roeper= df2.loc["Richard Roeper", "quote"]
Turan= df2.loc["Kenneth Turan", "quote"]
Wilmington= df2.loc["Michael Wilmington", "quote"]
Edelstein= df2.loc["David Edelstein", "quote"]
Gleiberman= df2.loc["Owen Gleiberman", "quote"]
Schwarzbaum= df2.loc["Lisa Schwarzbaum", "quote"]



Back to [top](#top) 

<a id='Step_2'></a>

#### Step 2: Process the text for analysis.

Create a function to process text for analysis, including tokenizing (break words into components), stemming with nltk PorterStemmer (consolidate different permutations of words), and removing stopwords (specific articles, prepositions and other words that do not add meaning).


In [5]:
def process(file):
    raw = open(file).read()
    tokens = word_tokenize(raw)
    words = [w.lower() for w in tokens]
    
    porter = nltk.PorterStemmer()
    stemmed_tokens = [porter.stem(t) for t in words]
    
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [w for w in stemmed_tokens if not w in stop_words]
    
    count = nltk.defaultdict(int)
    for word in filtered_tokens:
        count[word] += 1
    return count;

Back to [top](#top) 

<a id='Step_3'></a>

#### Step 3: Compute cosine similarities.

Create a function to calculate and compare the cosine of the angles for the three documents in each set.

In [6]:
def cos_sim(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

def getSimilarity(dict1, dict2):
    all_words_list = []
    for key in dict1:
        all_words_list.append(key)
    for key in dict2:
        all_words_list.append(key)
    all_words_list_size = len(all_words_list)
    
    v1 = np.zeros(all_words_list_size, dtype=np.int)
    v2 = np.zeros(all_words_list_size, dtype=np.int)
    i = 0
    for (key) in all_words_list:
        v1[i] = dict1.get(key, 0)
        v2[i] = dict2.get(key, 0)
        i = i + 1
    return cos_sim(v1, v2);

Back to [top](#top) 

<a id='Step_4'></a>

#### Step 4: Group reviewers for comparison, and create a for loop to calculate and print cosine similarities of reviewers.

In [7]:
sets = [["Lane", "Kehr", "Reed"], ["Lane", "Roeper", "Reed"], ["Kehr", "Rosenbaum", "Roeper"], ["Kehr", "Rosenbaum", "Turan"], ["Kehr", "Rosenbaum", "Wilmington"],
        ["Edelstein", "Gleiberman", "Wilmington"], ["Edelstein", "Gleiberman", "Roeper"], 
        ["Edelstein", "Gleiberman", "Lane"],["Edelstein", "Gleiberman", "Schwarzbaum"], 
       ["Edelstein", "Gleiberman","Reed"], ["Kehr", "Rosenbaum", "Edelstein"], 
        ["Lane", "Reed", "Wilmington"]]

In [8]:
numbers = range(0,12)
for num in numbers:
    rev1 = (sets[num])[0]
    rev2 = (sets[num])[1]
    rev3 = (sets[num])[2]
    dict1 = process(rev1)
    dict2 = process(rev2)
    dict3 = process(rev3)
    sim1 = (getSimilarity(dict1,dict2)).round(2)
    sim2 = (getSimilarity(dict1,dict3)).round(2)
    sim3 = (getSimilarity(dict2,dict3)).round(2)
    print(f"Similarity between {rev1} and {rev2} is {sim1}")
    print(f"Similarity between {rev1} and {rev3} is {sim2}")
    print(f"Similarity between {rev2} and {rev3} is {sim3}")
    print('\n')
    print('-----------------------')
    print('\n')
    

Similarity between Lane and Kehr is 0.92
Similarity between Lane and Reed is 0.89
Similarity between Kehr and Reed is 0.91


-----------------------


Similarity between Lane and Roeper is 0.79
Similarity between Lane and Reed is 0.89
Similarity between Roeper and Reed is 0.85


-----------------------


Similarity between Kehr and Rosenbaum is 0.97
Similarity between Kehr and Roeper is 0.88
Similarity between Rosenbaum and Roeper is 0.9


-----------------------


Similarity between Kehr and Rosenbaum is 0.97
Similarity between Kehr and Turan is 0.97
Similarity between Rosenbaum and Turan is 0.96


-----------------------


Similarity between Kehr and Rosenbaum is 0.97
Similarity between Kehr and Wilmington is 0.95
Similarity between Rosenbaum and Wilmington is 0.95


-----------------------


Similarity between Edelstein and Gleiberman is 0.94
Similarity between Edelstein and Wilmington is 0.93
Similarity between Gleiberman and Wilmington is 0.98


-----------------------


Similarit

Back to [top](#top) 

<a id='Step_5'></a>

#### Step 5: References

The primary references for this project are:

 - Schütze, Hinrich; Raghavan, Prabhakar; Manning, Christopher D. Introduction to Information Retrieval.
 - https://blogs.oracle.com/meena/finding-similarity-between-text-documents