# Problem set 3: Text analysis of DOJ press releases

**Total points (without extra credit)**: 52 

- For background:

    - DOJ is the federal law enforcement agency responsible for federal prosecutions; this contrasts with the local prosecutions in the Cook County dataset we analyzed earlier. Here's a short explainer on which crimes get prosecuted federally versus locally: https://www.criminaldefenselawyer.com/resources/criminal-defense/federal-crime/state-vs-federal-crimes.htm#:~:text=Federal%20criminal%20prosecutions%20are%20handled,of%20state%20and%20local%20law. 
    - Here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 
    - Here's the code the dataset creator used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

## 0.0 Import packages

In [1]:
## helpful packages
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import random
import re
import string

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
### ! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)


ImportError: Unable to import required dependencies:
numpy: No module named 'numpy'

In [3]:
## 0.1 Load and clean text data
## first, unzip the file pset3_inputdata.zip 
## then, run this code to load the unzipped json file and convert to a dataframe
## (may need to change the pathname depending on where you store stuff)
## and convert some of the attributes from lists to values
doj = pd.read_json("./combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 
           'components_clean']].copy()

doj.head()

NameError: name 'pd' is not defined

In [5]:
## your code to subset to one press release and take the string
pharma = doj[doj['id']== '17-1204']["contents"].to_string()
pharma

NameError: name 'doj' is not defined

## 1. Tagging and sentiment scoring (17 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call the raw string of this press release `pharma`

### 1.1 part of speech tagging (3 points)

A. Preprocess the `pharma` press release to remove all punctuation / digits (you can use `.isalpha()` to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech. 

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the `pharma` release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for `.isalpha()`: https://www.w3schools.com/python/ref_string_isalpha.asp

In [9]:
## your code here to restrict to alpha
pharma_tokenized = wordpunct_tokenize(pharma)
pharma_wordsonly = [token for token in pharma_tokenized if token.isalpha()]
pharma_wordsonly

NameError: name 'wordpunct_tokenize' is not defined

In [11]:
## your code here for part of speech tagging
pharma_tagged = pos_tag(pharma_wordsonly)
pharma_tagged
pharma_adjonly = [token[0] for token in pharma_tagged if token[1] in (["JJ","JJS", "JJR"])]
df_pharma_adjonly = pd.DataFrame(pharma_adjonly)
df_pharma_adjonly.value_counts().sort_values(ascending=False)[:5]

NameError: name 'pos_tag' is not defined

## 1.2 named entity recognition (4 points)

A. Using the original `pharma` press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: `LAW`

In [40]:
## your code here for part A
spacy_pharma = nlp(pharma)
spacy_pharma.ents

#for one_tok in spacy_pharma.ents:
#   print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)

NameError: name 'nlp' is not defined

In [42]:
## your code here for part B
pharma_law = [token.text for token in spacy_pharma.ents if token.label_ == "LAW"]
set(pharma_law)

NameError: name 'spacy_pharma' is not defined

C. Use Google to summarize in one sentence what the `RICO` named entity means and why this might apply to a pharmaceutical kickbacks case (and not just a mafia case...) 

In [24]:

#RICO stands for The Racketeer Influenced and Corrupt Organization Act. 
#Under RICO, it is illegal to participate in a pattern of racketeering activity or unlawful debt collection through an enterprise. 
# RICO applies to this case because the charges involve racketeering, mail fraud, and wire fraud, as Kapoor might have used fraud and bribes to market the drug.

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these named entities.

In [25]:
## your code here
regex_year = r".*year?s"
potential_sentence = [token.text for token in spacy_pharma.ents if (token.label_ == "DATE") and (re.search(regex_year,token.text))]
potential_sentence


['20 years', 'three years', 'five years', 'three years']

E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). 

**Hint**: you may want to use re.search or re.findall 

- For part E, you can use `re.search` and `re.findall`, or anything that works 😳.

In [26]:
## your code here
pharma_split = pharma.split("\xa0")
regex_year_sentence = r".*year?s.*"
list = [sentence for sentence in pharma_split if re.search(regex_year_sentence,sentence)]
list
#re.search(pharma_split,pharma).string

#At maximum, the CEO will face a sentence of up to 20 years in prison and a probation of 3 years.

[' This investigation highlights our commitment to defending our mail system from illegal misuse and ensuring public trust in the mail.”“The U.S. Department of Veterans Affairs, Office of Inspector General will continue to aggressively investigate those that attempt to fraudulently impact programs designed to benefit our veterans and their families,” said Donna L. Neves, Special Agent in Charge of the VA OIG Northeast Field Office.The charges of conspiracy to commit RICO and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, three years of supervised release and a fine of $250,000, or twice the amount of pecuniary gain or loss.',
 ' The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $25,000 fine. Sentences are imposed by a federal district court judge based upon the U.S. Sentencing Guidelines and other statutory factors

## 1.3 sentiment analysis  (10 points)

A. Subset the press releases to those labeled with one of three topics via `topics_clean`: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.



In [27]:
## your code here for subsetting
doj_subset = doj[doj["topics_clean"].isin(["Civil Rights", "Hate Crimes", "Project Safe Childhood"])]
doj_subset.shape
doj_subset

(717, 6)

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
77,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle"
155,15-1522,Alabama Man Found Guilty of Aggravated Sexual Abuse of a Child,"A federal jury convicted Rick Lee Evans, 43, of Anniston, Alabama, today of aggravated sexual abuse of a child after a five-day trial, Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Joyce White Vance of the Northern District of Alabama announced. According to evidence introduced at trial, Evans, a former U.S. Army soldier, and his then-wife, a Department of Defense employee, were residing in Germany when they were asked to take temporary custody of a five-year-old child whose parents were deployed to Iraq with the U.S. Army. Evans sexually abused the child on multiple occasions during the 18 months that the child lived with him from May 2007 to December 2008. Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. U.S. Army Criminal Investigations Division and the FBI’s Birmingham, Alabama, Division investigated the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2015-12-11T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern"
157,16-213,Alabama Man Indicted on Child Pornography and Sex Tourism Charges,"An Alabama native was indicted today and charged with multiple crimes involving travel with intent to engage in illicit sexual conduct with minors and child pornography, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Kenyen R. Brown of the Southern District of Alabama. Clarence Edward Evers Jr., aka Bud, a technology teacher employed by the Conecuh County, Alabama, Board of Education, was arrested on Feb. 11, 2016, and was charged today with five counts of travel with intent to engage in illicit sexual conduct with a minor, one count of attempted travel with intent to engage in illicit sexual conduct with a minor, one count of production and attempted production of child pornography, one count of transportation of child pornography, one count of receipt of child pornography, one count of access with intent to view child pornography and one count of possession of child pornography. According to the indictment, Evers allegedly traveled to Thailand in the summers of 2010 through 2014 for the purpose of engaging in illicit sexual conduct with a minor and allegedly attempted to make a similar trip in the spring of 2015. During the 2014 trip, Evers also allegedly photographed his victims’ abuse and then transported the images back to the United States. In addition, Evers allegedly had other images of child sexual exploitation on his computers and other electronic devices. The charges contained in the indictment are only allegations. Evers is presumed innocent unless and until he is proven guilty beyond a reasonable doubt in a court of law. ICE-HSI is investigating this case. Trial Attorney James E. Burke IV of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Sean P. Costello and Maria E. Murphy of the Southern District of Alabama are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-02-24T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Southern"
162,16-381,Alabama Man Indicted for Producing Child Pornography Involving Multiple Victims,"An Alabama man was indicted today by a federal grand jury in Birmingham, Alabama, on charges related to the production of child pornography involving four minor victims, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Joyce White Vance of the Northern District of Alabama. Gregory Jerome Lee, 53, formerly of Cullman County, Alabama, was indicted on four counts of production of child pornography, one count of conspiracy to advertise child pornography and one count of conspiracy to distribute and receive child pornography. According to the indictment, from September 1996 through December 2004, Lee used, persuaded, coerced and enticed minors to engage in sexually explicit conduct in order to produce images of that conduct. Between September 1996 and August 2007, Lee conspired with other individuals to distribute and receive child pornography through a variety of means, including the Internet. The U.S. Postal Inspection Service (USPIS) is investigating the case. Trial Attorney Amy E. Larson of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. The charges and allegations contained in an indictment are merely accusations. The defendant is presumed innocent unless and until proven guilty. Members of the public who may have information related to this matter should call the USPIS Birmingham Office at (205) 326-2909. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-03-30T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern"
168,14-464,Alabama Man Indicted for Threatening African-American Man and Another Person at Restaurant,"Jeremy Heath Higgins was indicted for threatening an African-American man at a Quinton, Alabama, restaurant, and for threatening another person who ordered Higgins to leave the restaurant due to his behavior, Acting Assistant Attorney General Jocelyn Samuels for the Justice Department’s Civil Rights Division and U.S. Attorney Joyce Vance for the Northern District of Alabama announced today. Higgins, 28, was charged in a three count indictment returned yesterday by a federal grand jury in the U.S. District Court for the Northern District of Alabama. The indictment charges him with one felony count and two misdemeanor counts of interference with a federally-protected activity. The indictment alleges that on June 14, 2013, Higgins approached and threatened an African-American man at the Alabama Rose Steakhouse because the man was present at the restaurant with a white woman. According to the indictment, another person ordered Higgins to leave the premises of the restaurant because of Higgins’ behavior toward the African-American man, after which Higgins allegedly shouted a threat to burn down the restaurant. The indictment further alleges that Higgins threatened the person who had ordered him to leave the restaurant by painting graffiti on the restaurant’s exterior and fence. If convicted of the felony count of the indictment, Higgins could face a maximum sentence of 10 years in prison and a $250,000 fine. For each of the misdemeanor charges, Higgins could face a maximum sentence of one year in prison and a $200,000 fine. This case is being investigated by the FBI and is being prosecuted by Assistant U.S. Attorney Robin B. Mark of the Northern District of Alabama and Trial Attorney David Reese of the Justice Department’s Civil Rights Division. An indictment is merely an accusation, and the defendant is presumed innocent unless proven guilty.",2014-05-01T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section
...,...,...,...,...,...,...
13002,09-368,West Virginia Man Pleads Guilty on Federal Civil Rights Charges,"WASHINGTON - Daryl Lee Fierce, 69, of Charleston, W.Va., pleaded guilty today to a civil rights charge in federal court in the Southern District of West Virginia for using fire to intimidate and interfere with a person’s housing rights. Fierce set fire to the victim’s home because African-American and biracial individuals visited the victim in her home. Pursuant to the plea agreement, Fierce faces up to 10 years in prison and a fine of up to $250,000. Sentencing is scheduled for July 30, 2009. According to documents filed in court, on or about July 16, 2007, Fierce admitted that he set fire to a home located on Noyes Avenue in Charleston because the tenant occupying the home, a white woman, associated with persons of another race and color. Fierce set fire to the outside wall of the victim’s bedroom at night as she slept. Fierce further admitted that before the incident he had used racial epithets against guests, including young children, who visited the victim’s home. ""Living in one’s home and associating with friends of one’s choosing, without violent interference because of race, is a core right of all persons in this country,"" said Loretta King, Acting Assistant Attorney General for Civil Rights. ""The defendant used violence against an innocent victim because of his racial prejudice. This is illegal, and despicable, and we will prosecute such crimes whenever and wherever they occur."" The FBI, the Charleston Police Department and the Charleston Fire Department investigated this case. The case was prosecuted by James Walsh with the Justice Department’s Civil Rights Division and Lisa G. Johnson, Assistant U.S. Attorney for the Southern District of West Virginia.",2009-04-20T00:00:00-04:00,Hate Crimes,Civil Rights Division
13032,18-775,Wisconsin Man Indicted for Producing Child Pornography Outside of the United States,"A Wisconsin man was charged in an indictment yesterday with the crimes of producing and possessing child pornography and engaging in illicit sexual conduct in a foreign place, announced Acting Assistant Attorney General John P. Cronan of the Justice Department’s Criminal Division and U.S. Attorney Matthew D. Krueger of the Eastern District of Wisconsin. Jeffrey H. Ernisse, 61, is currently incarcerated for state offenses related to child exploitation at the Red Granite Correctional Institution in Wisconsin. A grand jury in the U.S. District Court for the Eastern District of Wisconsin indicted Ernisse on two counts of producing child pornography, two counts of producing child pornography outside of the United States, one count of engaging in illicit sexual conduct with a minor in the Philippines and one count of possessing child pornography. According to the indictment, on or about March 10, 2015 and then again, on or about April 7, 2015, Ernisse used a minor to engage in sexually explicit conduct for the purpose of producing child pornography. Between approximately June 17, 2014, and approximately April 11, 2015, Ernisse engaged in illicit sexual conduct with a minor in the Republic of the Philippines. And on or about Dec. 18, 2015, Ernisse possessed child pornography. The charges contained in the indictment are merely allegations. The defendant is presumed innocent until proven guilty beyond a reasonable doubt in a court of law. U.S. Immigration and Customs Enforcement’s Homeland Security Investigations (HSI) is investigating this case with the cooperation of the Sheboygan, Wisconsin, Police Department. Trial Attorney William M. Grady of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Megan J. Paulson and Penelope L. Coblentz of the Eastern District of Wisconsin are prosecuting the case. This investigation is a part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state, and local resources to better locate, apprehend, and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2018-06-13T00:00:00-04:00,Project Safe Childhood,Criminal Division; Criminal - Child Exploitation and Obscenity Section
13034,12-596,Wisconsin Man Pleads Guilty to Sexual Exploitation of a Minor in Belize,"WASHINGTON – A Wisconsin man pleaded guilty today in federal court in Milwaukee to traveling in foreign commerce and engaging in and attempting to engage in illicit sexual conduct with a minor, announced Assistant Attorney General Lanny A. Breuer of the Justice Department’s Criminal Division; U.S. Attorney James L. Santelle of the Eastern District of Wisconsin; John Morton, Director of U.S. Immigration and Customs Enforcement (ICE); and Scott Bultrowicz, Director of the U.S. State Department’s Diplomatic Security Service (DSS). Roland J. Flath, 72, pleaded guilty before U.S. District Judge J.P. Stadtmueller. According to court documents, Flath, of Fond du Lac, Wis., traveled to Belize in July 2006, and subsequently sexually molested a minor girl from Belize. Flath was originally charged by a criminal complaint filed in the Eastern District of Wisconsin in October 2010. He was arrested by the Guatemalan National Civil Police on Feb. 20, 2011, expelled to the United States and arrested in the United States by ICE agents and the U.S. Marshal Service. Flath was indicted on March 22, 2011, by a grand jury sitting in the Eastern District of Wisconsin. Flath faces a maximum penalty of up to 30 years in prison and a fine of $250,000. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and the Criminal Division’s Child Exploitation and Obscenity Section (CEOS), Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.projectsafechildhood.gov. This case is being prosecuted by Assistant U.S. Attorney Penelope Coblentz of the Eastern District of Wisconsin and Trial Attorney Mi Yung Park of CEOS. Assistance was provided by the Office of International Affairs in the Justice Department’s Criminal Division. This case is a result of investigative efforts led by ICE Homeland Security Investigations (HSI) in Milwaukee and the DSS’s Regional Security Office in Belize, CEOS’s High Technology Investigative Unit, and the Belize Police Department.",2012-05-09T00:00:00-04:00,Project Safe Childhood,Criminal Division
13068,18-359,Wyoming Military Department Found Liable for Subjecting Employee to Sexual Harassment,"The Justice Department today announced that on March 21, 2018, a federal district court in Casper, Wyoming, found that the Wyoming Military Department (WMD) discriminated against former employee Amanda Dykes by subjecting her to sexual harassment and constructively discharging her. The verdict was returned after a July 2017 bench trial during which the Justice Department produced evidence that the defendant violated Title VII of the Civil Rights Act of 1964, which prohibits discrimination on the basis of race, color, national origin, sex, and religion. The evidence produced at trial showed that Dykes was subjected to sexual harassment by her direct supervisor, former employee Don Smith, when both worked at WMD’s Wyoming Youth Challenge Program. Smith subjected Dykes to persistent, unwelcomed conduct including poems, songs, and emails professing his affection and love for her as well as constant visits to her office. These intensified to such a degree that Dykes asked her subordinates to help her avoid being left alone with her supervisor. Dykes reported the supervisor’s conduct to her employer’s human resources department as well as to his direct supervisor, but received no assistance in remedying the harassment. The court found that harassing behavior persisted for over 18 months despite Dykes’ numerous complaints, that no reasonable employee could be expected to remain in her job under these circumstances, and that Dykes had no choice but to resign her position in September 2011 to avoid the continued harassment. The district court ordered WMD to pay $221,030.62 to Dykes for the salary and benefits she lost as a result of her constructive discharge. This judgment represents the first successful sexual harassment trial verdict obtained in a Title VII case since the launch of the Civil Rights Division’s Sexual Harassment in the Workplace Initiative (SHWI), which focuses on workplace sexual harassment in the public sector. As part of the Initiative, the Justice Department will continue to bring sex discrimination claims against state and local government employers with a renewed emphasis on sexual harassment charges. The Department will also work to develop effective remedial measures that can be used to hold public sector employers accountable where Title VII violations have been found, including identifying changes to existing employer practices and policies that will result in safe work environments. More information about the Civil Rights’ Division’s Sexual Harassment in the Workplace Initiative can be found here. “The Justice Department vigorously enforces Title VII to ensure that people can work free from sexual harassment and retaliation,” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The verdict sends the clear message that this Justice Department will continue to effectively combat sex-based discrimination whenever it occurs in a public sector workplace.” Dykes originally filed her sexual harassment charge against the WMD with the Denver Field Office of the Equal Employment Opportunity Commission (EEOC), which investigated and determined that there was reasonable cause to believe that discrimination had occurred and referred the matters to the Department of Justice. More information about Title VII and other federal employment laws is available at the division’s Employment Litigation Section website. The continued enforcement of Title VII is a priority of the Civil Rights Division. Additional information about the Civil Rights Division of the Department of Justice is available on the division website. EEOC enforces federal laws prohibiting employment discrimination. Further information about EEOC is available on its website. The United States was represented in this case by Robert Galbreath, Torie Atkinson, Brian McEntire, and Patty Stasco.",2018-03-23T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Employment Litigation Section; USAO - Wyoming


B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string (**Hint**: you may want to use `re.sub` with an or condition)
- Scores the sentiment of the entire press release using the `SentimentIntensityAnalyzer` and `polarity_scores`
- Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- A function + list comprehension to execute will takes about 30 seconds on a respectable local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717


In [28]:
## your code here to define function
def content_sentiment(one_content):
    spacy_content = nlp(one_content)
    named_ents = [ent.text for ent in spacy_content.ents]
    # joined_ents = '|'.join(named_ents)
    regex_ents = r'|'.join(re.escape(ent) for ent in named_ents) 
    no_ents = re.sub(regex_ents, "",one_content)
    sent_obj = SentimentIntensityAnalyzer()
    sentiment_content = sent_obj.polarity_scores(no_ents)
    return sentiment_content

content = " The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $25,000 fine. Sentences are imposed by a federal district court judge based upon the U.S. Sentencing Guidelines and other statutory factors.The investigation was conducted by a team that included the FBI; HHS-OIG; FDA Office of Criminal Investigations; the Defense Criminal Investigative Service; the Drug Enforcement Administration; the Department of Labor, Employee Benefits Security Administration; the Office of Personnel Management; the U.S. Postal Inspection Service; the U.S. Postal Service Office of Inspector General; and the Department of Veterans Affairs.']"

content_sentiment(content)
#re.findall(content_sentiment(content), "25,000")

{'neg': 0.194, 'neu': 0.743, 'pos': 0.063, 'compound': -0.875}

In [29]:
## your code here executing the function
doj_subset["sentiment_dictionary"]= doj_subset["contents"].apply(content_sentiment)
doj_subset["sentiment_dictionary"]

77       {'neg': 0.199, 'neu': 0.751, 'pos': 0.049, 'compound': -0.9931}
155      {'neg': 0.133, 'neu': 0.799, 'pos': 0.068, 'compound': -0.9325}
157       {'neg': 0.093, 'neu': 0.83, 'pos': 0.077, 'compound': -0.7579}
162      {'neg': 0.126, 'neu': 0.789, 'pos': 0.085, 'compound': -0.9037}
168      {'neg': 0.178, 'neu': 0.778, 'pos': 0.044, 'compound': -0.9864}
                                      ...                               
13002     {'neg': 0.155, 'neu': 0.78, 'pos': 0.064, 'compound': -0.9689}
13032     {'neg': 0.082, 'neu': 0.813, 'pos': 0.105, 'compound': 0.7003}
13034    {'neg': 0.158, 'neu': 0.754, 'pos': 0.088, 'compound': -0.9648}
13068    {'neg': 0.139, 'neu': 0.762, 'pos': 0.099, 'compound': -0.9798}
13081    {'neg': 0.151, 'neu': 0.819, 'pos': 0.031, 'compound': -0.9921}
Name: sentiment_dictionary, Length: 717, dtype: object

C. Add the four sentiment scores to the `doj_subset` dataframe to create a dataframe: `doj_subset_wscore`. Sort from highest neg to lowest neg score and print the top `id`, `contents`, and `neg` columns of the two most neg press releases. 

Notes:

- Don't worry if your sentiment score differs slightly from our output on GitHub; differences in preprocessing can lead to diff scores

In [30]:
## your code here
doj_subset_copy = doj_subset.copy().reset_index()

doj_subset_wscore = pd.concat([doj_subset_copy, 
                               pd.json_normalize(doj_subset_copy["sentiment_dictionary"]).reset_index()], 
                               axis=1)
doj_subset_wscore_sorted = doj_subset_wscore.sort_values(by="neg", ascending=False)
doj_subset_wscore_sorted[["id","contents","neg"]][:2]

Unnamed: 0,id,contents,neg
13,14-248,"The Department of Justice announced that this morning John W. Ng, 58, of Albuquerque, N.M., made his initial appearance in federal court on a criminal complaint charging him with a hate crime offense. This charge is related to anti-Semitic threats Ng made against a Jewish woman who owns and operates the Nosh Jewish Delicatessen and Bakery in Albuquerque. Ng was arrested by the FBI on March 7, 2014, based on a criminal complaint alleging that he interfered with the victim’s federally protected rights by threatening her and interfering with her business because of her religion. According to the criminal complaint, between Jan. 22, 2014, and Feb. 8, 2014, Ng allegedly posted threatening anti-Semitic notes on and in the vicinity of the victim’s business. A criminal complaint merely establishes probable cause, and Ng is presumed innocent unless proven guilty. If convicted on the offense charged in the criminal complaint, Ng faces a maximum statutory penalty of one year in prison. This matter was investigated by the Albuquerque Division of the FBI and is being prosecuted by Assistant U.S. Attorney Mark T. Baker of the U.S. Attorney’s Office for the District of New Mexico and Trial Attorney AeJean Cha of the U.S. Department of Justice’s Civil Rights Division.",0.323
34,13-312,"John Hall, 27, an Aryan Brotherhood member and inmate at the Federal Correctional Institution (FCI) in Seagoville, Texas, was sentenced today by U.S. District Judge Reed O’Connor after pleading guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act stemming from his assault of a fellow inmate, whom he believed to be gay, the Department of Justice announced. Hall assaulted his fellow inmate with a dangerous weapon, causing bodily injury to the victim on Dec. 20, 2011. Hall was sentenced to serve 71 months in prison to be served consecutively with the sentence he is currently serving. The assault occurred on Dec. 20, 2011, inside the FCI Seagoville when Hall targeted and attacked the victim, a fellow inmate, because he believed the victim was gay or involved in a sexual relationship with another male inmate. Hall repeatedly punched, kicked and stomped on the victim’s face with his shod feet, a dangerous weapon, while yelling a homophobic slur. The victim lost consciousness during the assault and suffered multiple lacerations to his face. The victim also sustained a fractured eye socket, lost a tooth, fractured other teeth and was treated at a hospital for the injuries he sustained during Hall’s unprovoked attack. Hall pleaded guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act on Nov. 8, 2012. “Brutality and violence based on sexual orientation has no place in a civilized society,” said Thomas E. Perez, Assistant Attorney General for the Civil Rights Division. “The Justice Department is committed to using all the tools in our law enforcement arsenal, including the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, to prosecute acts motivated by hate.” “This prosecution sends a clear message that this office, in partnership with attorneys in the department’s Civil Rights Division, will prioritize and aggressively prosecute hate crimes and others civil rights violations in North Texas,” said U.S. Attorney Sarah R. Saldaña of the Northern District of Texas. This case was investigated by the FBI Dallas Division. The case was prosecuted by Assistant U.S. Attorney Errin Martin and Trial Attorney Adriana Vieco of the Civil Rights Division.",0.303


D. With the dataframe from part C, find the mean compound sentiment score for each of the three topics in `topics_clean` using group_by and agg.

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)


In [31]:
## agg and find the mean compound score by topic
doj_subset_wscore_sorted.groupby("topics_clean").agg({"compound":'mean'})

## Part E
# It's possible that for certain topic, the press release includes more emotional words. For example, the press release for hate crimes might 
# quote hate speech, which is strongly correlated with negative sentiments. On the other hand, press release civil rights might contain more neutral language

Unnamed: 0_level_0,compound
topics_clean,Unnamed: 1_level_1
Civil Rights,-0.093931
Hate Crimes,-0.930943
Project Safe Childhood,-0.681391


# 2. Topic modeling (25 points)

For this question, use the `doj_subset_wscores` data that is restricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


## 2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in a single raw string in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem

- Returns a joined preprocessed string
    
B. Use `apply` or list comprehension to execute that function and create a new column in the data called `processed_text`
    
C. Print the `id`, `contents`, and `processed_text` columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/

In [32]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]

In [33]:
## your code defining a text processing function
list_stopwords = stopwords.words('english')+custom_doj_stopwords
from nltk.stem.snowball import SnowballStemmer
snowball = SnowballStemmer(language='english')
def preprocess(content):
    content_lower = content.lower()
    content_nostop = [snowball.stem(word) for word in wordpunct_tokenize(content_lower)
                         if word not in list_stopwords 
                          and word.isalpha() and len(word)>=4]
    content_rejoin = " ".join(content_nostop)
    return content_rejoin
    
preprocess(content)

'charg conspiraci violat anti kickback provid sentenc greater five year prison three year supervis releas fine sentenc impos feder court judg base upon sentenc guidelin statutori factor conduct team includ crimin investig defens crimin investig servic drug enforc administr labor employe benefit secur administr personnel manag postal inspect servic postal servic inspector general veteran affair'

In [34]:
## your code executing the function
doj_subset_wscore["processed_text"] = doj_subset_wscore["contents"].apply(preprocess)

In [35]:
## your code showing the examples
doj_subset_wscore[doj_subset_wscore["id"].isin(["16-718","16-217"])][["id", "contents", "processed_text"]]

Unnamed: 0,id,contents,processed_text
313,16-217,"The Justice Department has reached a comprehensive settlement agreement with the city of Miami and the Miami Police Department (MPD) resolving the Justice Department’s investigation of officer-involved shootings by MPD officers, announced Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division and U.S. Attorney Wifredo A. Ferrer of the Southern District of Florida. The settlement, which was approved by Miami’s city commission today and will go into effect when the agreement is signed by all parties, resolves claims stemming from the Justice Department’s investigation into officer-involved shootings by MPD officers, which was conducted under the Violent Crime Control and Law Enforcement Act of 1994. The investigation’s findings, issued in July 2013, identified a pattern or practice of excessive use of force through officer-involved shootings in violation of the Fourth Amendment of the Constitution. The city’s compliance with the settlement will be monitored by an independent reviewer, former Tampa, Florida, Police Chief Jane Castor. Under the settlement agreement, the city will implement comprehensive reforms to ensure constitutional policing and support public trust. The settlement agreement is designed to minimize officer-involved shootings and to more effectively and quickly investigate officer-involved shootings that do occur, through measures that include: “This settlement represents a renewed commitment by the city of Miami and Chief Rodolfo Llanes to provide constitutional policing for Miami residents and to protect public safety through sustainable reform,” said Principal Deputy Assistant Attorney General Gupta. “The agreement will help to strengthen the relationship between the MPD and the communities they serve by improving accountability for officers who fire their weapons unlawfully, and provides for community participation in the enforcement of this agreement.” “Today's agreement is the result of a joint effort between the Department of Justice and the City of Miami to ensure that the Miami Police Department continues its efforts to make our community safe while protecting the sacred Constitutional rights of all of our citizens,” said U.S. Attorney Ferrer. “Through oversight and communication, the agreement seeks to make permanent the positive changes that former Chief Orosa and Chief Llanes have made, and we applaud the City Commission’s vote.” The settlement agreement builds upon important reforms implemented by the city since the Justice Department issued its findings, including: The investigation was conducted by attorneys and staff from the Civil Rights Division’s Special Litigation Section and the Civil Division of the U. S. Attorney’s Office of the Southern District of Florida.",reach comprehens settlement agreement citi miami miami polic resolv offic involv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem offic involv shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc offic involv shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim offic involv shoot effect quick investig offic involv shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc agreement today agreement result joint effort citi miami ensur miami polic continu effort make communiti safe protect sacr constitut citizen said ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss vote settlement agreement build upon import reform implement citi sinc issu find includ conduct attorney staff special litig section southern florida
632,16-718,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up. The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground. Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H. The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI’s Jackson Division, with the cooperation of the Mississippi Department of Corrections. It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division’s Criminal Section. Marsher Indictment",nine count indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti investig jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus crimin section marsher indict


## 2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: `id`, `compound` sentiment column you added, and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. Print the top 10 words for press releases in each of the three `topics_clean`

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.


In [64]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase=True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), 
                                   columns=vectorizer.get_feature_names_out())  # Updated method
    dtm_dense_named_withid = pd.concat([metadata.reset_index(drop=True), dtm_dense_named], axis=1)
    return dtm_dense_named_withid

In [65]:
# Step A: Generate the Document-Term Matrix 
metadata = doj_subset_wscore[["id", "compound", "topics_clean"]]
dtm = create_dtm(doj_subset_wscore["processed_text"], metadata)

In [1]:
# Get_topwords function
def get_topwords(dtm_subset, top_n=10):
    word_counts = dtm_subset.iloc[:, 3:].sum().sort_values(ascending=False)
    return word_counts.head(top_n)

NameError: name 'dtm' is not defined

In [3]:
# Step B:  Print the top 10 words for press releases with compound sentiment in the top 5%
# Calculate the 95th percentile for compound sentiment
# drop duplicated columns from preprocessing
duplicates = dtm.columns.duplicated(keep='first')
dtm = dtm.loc[:, ~duplicates]
dtm_top_positive = dtm[dtm["compound"] >= dtm["compound"].quantile(0.95)].reset_index(drop=True)

NameError: name 'dtm' is not defined

In [127]:
# Step C: Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)
dtm_top_positive = dtm[dtm["compound"] <= dtm["compound"].quantile(0.05)]

ValueError: cannot reindex on an axis with duplicate labels

In [100]:
# Step D: Print the top 10 words for press releases in each of the three `topics_clean`
for topic in dtm["topics_clean"].unique():
    dtm_topic_subset = dtm[dtm["topics_clean"] == topic]
    get_topwords(dtm_topic_subset)

offic        637
hous         633
discrimin    616
enforc       544
disabl       532
said         497
feder        479
violat       477
state        452
court        414
dtype: int64

child          1022
exploit         701
sexual          572
safe            479
childhood       474
project         472
pornographi     452
children        423
crimin          405
prosecut        374
dtype: int64

victim      591
crime       557
hate        524
defend      484
prosecut    478
charg       463
sentenc     455
american    451
feder       432
guilti      430
dtype: int64

## 2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset_wscores` dataframe

B. Add the topic probabilities to the `doj_subset_wscores` dataframe as columns and create a column, `top_topic`, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)


In [116]:
## your code here to get doc-level topic probabilities 
topic_probs = [lda_model.get_document_topics(doc, minimum_probability=0) for doc in corpus]
assert len(topic_probs) == len(doj_subset_wscore), "The number of topic probability lists should match the number of rows in doj_subset_wscores."

In [117]:
## your code here to add those topic probabilities to the dataframe
topic_df = pd.DataFrame([[prob[1] for prob in doc_probs] for doc_probs in topic_probs],
                        columns=["Topic_1_Prob", "Topic_2_Prob", "Topic_3_Prob"])
doj_subset_wscore = pd.concat([doj_subset_wscore.reset_index(drop=True), topic_df], axis=1)
doj_subset_wscore["top_topic"] = topic_df.idxmax(axis=1)
doj_subset_wscore.head()

Unnamed: 0,index,id,title,contents,date,topics_clean,components_clean,sentiment_dictionary,index.1,neg,...,Topic_1_Prob,Topic_2_Prob,Topic_3_Prob,processed_text_bigrams,Topic_1_Prob.1,Topic_2_Prob.1,Topic_3_Prob.1,Topic_1_Prob.2,Topic_2_Prob.2,Topic_3_Prob.2
0,77,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle","{'neg': 0.199, 'neu': 0.751, 'pos': 0.049, 'compound': -0.9931}",0,0.199,...,0.018474,0.978801,0.002725,former_supervisori supervisori_correct correct_offic offic_louisiana louisiana_state state_penitentiari penitentiari_angola angola_louisiana louisiana_plead plead_guilti guilti_yesterday yesterday_connect connect_beat beat_handcuf handcuf_shackl shackl_inmat inmat_addit addit_conspir conspir_cover cover_misconduct misconduct_falsifi falsifi_offici offici_record record_lie lie_intern intern_investig investig_happen happen_jame jame_savoy savoy_marksvill marksvill_louisiana louisiana_admit admit_plea plea_hear hear_wit wit_offic offic_use use_excess excess_forc forc_inmat inmat_fail fail_interven interven_conspir conspir_offic offic_cover cover_beat beat_engag engag_varieti varieti_obstruct obstruct_act act_person person_falsifi falsifi_offici offici_prison prison_record record_cover cover_attack attack_scotti scotti_kennedi kennedi_beeb beeb_arkansa arkansa_john john_sander sander_marksvill marksvill_louisiana louisiana_previous previous_plead plead_guilti guilti_novemb novemb_septemb septemb_role role_beat beat_cover cover_everi everi_citizen citizen_right right_process process_protect protect_unreason unreason_forc forc_correct correct_offic offic_violat violat_basic basic_constitut constitut_must must_held held_account account_egregi egregi_action action_said said_act act_general general_john john_gore gore_continu continu_vigor vigor_prosecut prosecut_correct correct_offic offic_violat violat_public public_trust trust_commit commit_crime crime_cover cover_violat violat_feder feder_crimin crimin_yesterday yesterday_anoth anoth_exampl exampl_unwav unwav_commit commit_pursu pursu_violat violat_feder feder_crimin crimin_law law_said said_act act_unit unit_state state_middl middl_louisiana louisiana_corey corey_amundson amundson_continu continu_work work_close close_ensur ensur_investig investig_baton baton_roug roug_resid resid_agenc agenc_prosecut prosecut_frederick frederick_menner menner_middl middl_louisiana louisiana_christoph christoph_perra perra_crimin crimin_section,0.017223,0.980052,0.002725,0.016957,0.980318,0.002725
1,155,15-1522,Alabama Man Found Guilty of Aggravated Sexual Abuse of a Child,"A federal jury convicted Rick Lee Evans, 43, of Anniston, Alabama, today of aggravated sexual abuse of a child after a five-day trial, Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Joyce White Vance of the Northern District of Alabama announced. According to evidence introduced at trial, Evans, a former U.S. Army soldier, and his then-wife, a Department of Defense employee, were residing in Germany when they were asked to take temporary custody of a five-year-old child whose parents were deployed to Iraq with the U.S. Army. Evans sexually abused the child on multiple occasions during the 18 months that the child lived with him from May 2007 to December 2008. Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. U.S. Army Criminal Investigations Division and the FBI’s Birmingham, Alabama, Division investigated the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2015-12-11T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern","{'neg': 0.133, 'neu': 0.799, 'pos': 0.068, 'compound': -0.9325}",1,0.133,...,0.003262,0.003026,0.993712,feder_juri juri_convict convict_rick rick_evan evan_anniston anniston_alabama alabama_today today_aggrav aggrav_sexual sexual_abus abus_child child_five five_general general_lesli lesli_caldwel caldwel_crimin crimin_joyc joyc_white white_vanc vanc_northern northern_alabama alabama_announc announc_accord accord_evid evid_introduc introduc_evan evan_former former_armi armi_soldier soldier_wife wife_defens defens_employe employe_resid resid_germani germani_ask ask_take take_temporari temporari_custodi custodi_five five_year year_child child_whose whose_parent parent_deploy deploy_iraq iraq_armi armi_evan evan_sexual sexual_abus abus_child child_multipl multipl_occas occas_month month_child child_live live_decemb decemb_austin austin_berri berri_crimin crimin_child child_exploit exploit_obscen obscen_section section_ceo ceo_jacquelyn jacquelyn_hutzel hutzel_northern northern_alabama alabama_prosecut prosecut_armi armi_crimin crimin_investig investig_birmingham birmingham_alabama alabama_investig investig_brought brought_part part_project project_safe safe_childhood childhood_nationwid nationwid_initi initi_combat combat_grow grow_epidem epidem_child child_sexual sexual_exploit exploit_abus abus_launch launch_attorney attorney_offic offic_ceo ceo_project project_safe safe_childhood childhood_marshal marshal_feder feder_state state_local local_resourc resourc_better better_locat locat_apprehend apprehend_prosecut prosecut_individu individu_exploit exploit_children children_internet internet_well well_identifi identifi_rescu rescu_victim victim_inform inform_project project_safe safe_childhood childhood_pleas pleas_visit,0.003262,0.003026,0.993712,0.003265,0.003026,0.993709
2,157,16-213,Alabama Man Indicted on Child Pornography and Sex Tourism Charges,"An Alabama native was indicted today and charged with multiple crimes involving travel with intent to engage in illicit sexual conduct with minors and child pornography, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Kenyen R. Brown of the Southern District of Alabama. Clarence Edward Evers Jr., aka Bud, a technology teacher employed by the Conecuh County, Alabama, Board of Education, was arrested on Feb. 11, 2016, and was charged today with five counts of travel with intent to engage in illicit sexual conduct with a minor, one count of attempted travel with intent to engage in illicit sexual conduct with a minor, one count of production and attempted production of child pornography, one count of transportation of child pornography, one count of receipt of child pornography, one count of access with intent to view child pornography and one count of possession of child pornography. According to the indictment, Evers allegedly traveled to Thailand in the summers of 2010 through 2014 for the purpose of engaging in illicit sexual conduct with a minor and allegedly attempted to make a similar trip in the spring of 2015. During the 2014 trip, Evers also allegedly photographed his victims’ abuse and then transported the images back to the United States. In addition, Evers allegedly had other images of child sexual exploitation on his computers and other electronic devices. The charges contained in the indictment are only allegations. Evers is presumed innocent unless and until he is proven guilty beyond a reasonable doubt in a court of law. ICE-HSI is investigating this case. Trial Attorney James E. Burke IV of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Sean P. Costello and Maria E. Murphy of the Southern District of Alabama are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-02-24T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Southern","{'neg': 0.093, 'neu': 0.83, 'pos': 0.077, 'compound': -0.7579}",2,0.093,...,0.00196,0.001855,0.996185,alabama_nativ nativ_indict indict_today today_charg charg_multipl multipl_crime crime_involv involv_travel travel_intent intent_engag engag_illicit illicit_sexual sexual_conduct conduct_minor minor_child child_pornographi pornographi_announc announc_general general_lesli lesli_caldwel caldwel_crimin crimin_kenyen kenyen_brown brown_southern southern_alabama alabama_clarenc clarenc_edward edward_ever ever_technolog technolog_teacher teacher_employ employ_conecuh conecuh_counti counti_alabama alabama_board board_educ educ_arrest arrest_charg charg_today today_five five_count count_travel travel_intent intent_engag engag_illicit illicit_sexual sexual_conduct conduct_minor minor_count count_attempt attempt_travel travel_intent intent_engag engag_illicit illicit_sexual sexual_conduct conduct_minor minor_count count_product product_attempt attempt_product product_child child_pornographi pornographi_count count_transport transport_child child_pornographi pornographi_count count_receipt receipt_child child_pornographi pornographi_count count_access access_intent intent_view view_child child_pornographi pornographi_count count_possess possess_child child_pornographi pornographi_accord accord_indict indict_ever ever_alleg alleg_travel travel_thailand thailand_summer summer_purpos purpos_engag engag_illicit illicit_sexual sexual_conduct conduct_minor minor_alleg alleg_attempt attempt_make make_similar similar_trip trip_spring spring_trip trip_ever ever_also also_alleg alleg_photograph photograph_victim victim_abus abus_transport transport_imag imag_back back_unit unit_state state_addit addit_ever ever_alleg alleg_imag imag_child child_sexual sexual_exploit exploit_comput comput_electron electron_devic devic_charg charg_contain contain_indict indict_alleg alleg_ever ever_presum presum_innoc innoc_unless unless_proven proven_guilti guilti_beyond beyond_reason reason_doubt doubt_court court_investig investig_jame jame_burk burk_crimin crimin_child child_exploit exploit_obscen obscen_section section_ceo ceo_attorney attorney_sean sean_costello costello_maria maria_murphi murphi_southern southern_alabama alabama_prosecut prosecut_brought brought_part part_project project_safe safe_childhood childhood_nationwid nationwid_initi initi_combat combat_grow grow_epidem epidem_child child_sexual sexual_exploit exploit_abus abus_launch launch_attorney attorney_offic offic_ceo ceo_project project_safe safe_childhood childhood_marshal marshal_feder feder_state state_local local_resourc resourc_better better_locat locat_apprehend apprehend_prosecut prosecut_individu individu_exploit exploit_children children_internet internet_well well_identifi identifi_rescu rescu_victim victim_inform inform_project project_safe safe_childhood childhood_pleas pleas_visit,0.001958,0.001855,0.996187,0.001958,0.001855,0.996187
3,162,16-381,Alabama Man Indicted for Producing Child Pornography Involving Multiple Victims,"An Alabama man was indicted today by a federal grand jury in Birmingham, Alabama, on charges related to the production of child pornography involving four minor victims, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Joyce White Vance of the Northern District of Alabama. Gregory Jerome Lee, 53, formerly of Cullman County, Alabama, was indicted on four counts of production of child pornography, one count of conspiracy to advertise child pornography and one count of conspiracy to distribute and receive child pornography. According to the indictment, from September 1996 through December 2004, Lee used, persuaded, coerced and enticed minors to engage in sexually explicit conduct in order to produce images of that conduct. Between September 1996 and August 2007, Lee conspired with other individuals to distribute and receive child pornography through a variety of means, including the Internet. The U.S. Postal Inspection Service (USPIS) is investigating the case. Trial Attorney Amy E. Larson of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. The charges and allegations contained in an indictment are merely accusations. The defendant is presumed innocent unless and until proven guilty. Members of the public who may have information related to this matter should call the USPIS Birmingham Office at (205) 326-2909. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-03-30T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern","{'neg': 0.126, 'neu': 0.789, 'pos': 0.085, 'compound': -0.9037}",3,0.126,...,0.020199,0.00234,0.977461,alabama_indict indict_today today_feder feder_grand grand_juri juri_birmingham birmingham_alabama alabama_charg charg_relat relat_product product_child child_pornographi pornographi_involv involv_four four_minor minor_victim victim_announc announc_general general_lesli lesli_caldwel caldwel_crimin crimin_joyc joyc_white white_vanc vanc_northern northern_alabama alabama_gregori gregori_jerom jerom_former former_cullman cullman_counti counti_alabama alabama_indict indict_four four_count count_product product_child child_pornographi pornographi_count count_conspiraci conspiraci_advertis advertis_child child_pornographi pornographi_count count_conspiraci conspiraci_distribut distribut_receiv receiv_child child_pornographi pornographi_accord accord_indict indict_septemb septemb_decemb decemb_use use_persuad persuad_coerc coerc_entic entic_minor minor_engag engag_sexual sexual_explicit explicit_conduct conduct_order order_produc produc_imag imag_conduct conduct_septemb septemb_august august_conspir conspir_individu individu_distribut distribut_receiv receiv_child child_pornographi pornographi_varieti varieti_mean mean_includ includ_internet internet_postal postal_inspect inspect_servic servic_uspi uspi_investig investig_larson larson_crimin crimin_child child_exploit exploit_obscen obscen_section section_ceo ceo_jacquelyn jacquelyn_hutzel hutzel_northern northern_alabama alabama_prosecut prosecut_charg charg_alleg alleg_contain contain_indict indict_mere mere_accus accus_defend defend_presum presum_innoc innoc_unless unless_proven proven_guilti guilti_member member_public public_inform inform_relat relat_matter matter_call call_uspi uspi_birmingham birmingham_brought brought_part part_project project_safe safe_childhood childhood_nationwid nationwid_initi initi_combat combat_grow grow_epidem epidem_child child_sexual sexual_exploit exploit_abus abus_launch launch_attorney attorney_offic offic_ceo ceo_project project_safe safe_childhood childhood_marshal marshal_feder feder_state state_local local_resourc resourc_better better_locat locat_apprehend apprehend_prosecut prosecut_individu individu_exploit exploit_children children_internet internet_well well_identifi identifi_rescu rescu_victim victim_inform inform_project project_safe safe_childhood childhood_pleas pleas_visit,0.014402,0.00234,0.983257,0.01802,0.00234,0.979639
4,168,14-464,Alabama Man Indicted for Threatening African-American Man and Another Person at Restaurant,"Jeremy Heath Higgins was indicted for threatening an African-American man at a Quinton, Alabama, restaurant, and for threatening another person who ordered Higgins to leave the restaurant due to his behavior, Acting Assistant Attorney General Jocelyn Samuels for the Justice Department’s Civil Rights Division and U.S. Attorney Joyce Vance for the Northern District of Alabama announced today. Higgins, 28, was charged in a three count indictment returned yesterday by a federal grand jury in the U.S. District Court for the Northern District of Alabama. The indictment charges him with one felony count and two misdemeanor counts of interference with a federally-protected activity. The indictment alleges that on June 14, 2013, Higgins approached and threatened an African-American man at the Alabama Rose Steakhouse because the man was present at the restaurant with a white woman. According to the indictment, another person ordered Higgins to leave the premises of the restaurant because of Higgins’ behavior toward the African-American man, after which Higgins allegedly shouted a threat to burn down the restaurant. The indictment further alleges that Higgins threatened the person who had ordered him to leave the restaurant by painting graffiti on the restaurant’s exterior and fence. If convicted of the felony count of the indictment, Higgins could face a maximum sentence of 10 years in prison and a $250,000 fine. For each of the misdemeanor charges, Higgins could face a maximum sentence of one year in prison and a $200,000 fine. This case is being investigated by the FBI and is being prosecuted by Assistant U.S. Attorney Robin B. Mark of the Northern District of Alabama and Trial Attorney David Reese of the Justice Department’s Civil Rights Division. An indictment is merely an accusation, and the defendant is presumed innocent unless proven guilty.",2014-05-01T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,"{'neg': 0.178, 'neu': 0.778, 'pos': 0.044, 'compound': -0.9864}",4,0.178,...,0.687335,0.00296,0.309705,jeremi_heath heath_higgin higgin_indict indict_threaten threaten_african african_american american_quinton quinton_alabama alabama_restaur restaur_threaten threaten_anoth anoth_person person_order order_higgin higgin_leav leav_restaur restaur_behavior behavior_act act_general general_jocelyn jocelyn_samuel samuel_joyc joyc_vanc vanc_northern northern_alabama alabama_announc announc_today today_higgin higgin_charg charg_three three_count count_indict indict_return return_yesterday yesterday_feder feder_grand grand_juri juri_court court_northern northern_alabama alabama_indict indict_charg charg_feloni feloni_count count_misdemeanor misdemeanor_count count_interfer interfer_feder feder_protect protect_activ activ_indict indict_alleg alleg_june june_higgin higgin_approach approach_threaten threaten_african african_american american_alabama alabama_rose rose_steakhous steakhous_present present_restaur restaur_white white_woman woman_accord accord_indict indict_anoth anoth_person person_order order_higgin higgin_leav leav_premis premis_restaur restaur_higgin higgin_behavior behavior_toward toward_african african_american american_higgin higgin_alleg alleg_shout shout_threat threat_burn burn_restaur restaur_indict indict_alleg alleg_higgin higgin_threaten threaten_person person_order order_leav leav_restaur restaur_paint paint_graffiti graffiti_restaur restaur_exterior exterior_fenc fenc_convict convict_feloni feloni_count count_indict indict_higgin higgin_could could_face face_maximum maximum_sentenc sentenc_year year_prison prison_fine fine_misdemeanor misdemeanor_charg charg_higgin higgin_could could_face face_maximum maximum_sentenc sentenc_year year_prison prison_fine fine_investig investig_prosecut prosecut_robin robin_mark mark_northern northern_alabama alabama_david david_rees rees_indict indict_mere mere_accus accus_defend defend_presum presum_innoc innoc_unless unless_proven proven_guilti,0.708263,0.00296,0.288778,0.691473,0.002965,0.305562


In [118]:
## your code here to summarize the topic proportions for each of the topics_clean 
topic_breakdown = pd.crosstab(doj_subset_wscore["topics_clean"], doj_subset_wscore["top_topic"], normalize="index") * 100
topic_breakdown

top_topic,Topic_1_Prob,Topic_2_Prob,Topic_3_Prob
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,49.180328,28.852459,21.967213
Hate Crimes,23.170732,72.357724,4.471545
Project Safe Childhood,1.204819,0.0,98.795181


Press releases on "Civil Rights" may encompass a broader range of issues, from voting rights to discrimination in housing or employment, leading to a more diverse vocabulary. This variety can cause these documents to map less distinctly to a single topic, as different aspects of civil rights could overlap with themes from other manual labels or estimated topics.

# 3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the `doj_subset_wscore` data and the `processed_text` column (so the words after stemming/other preprocessing), create a column in the data called `processed_text_bigrams` that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function `create_bigram_onedoc` that takes in a single `processed_text` string and returns a string with its bigrams structured similarly to above example
 
**Hint**: there are many ways to solve but `zip` may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the `id`, `processed_text`, and `processed_text_bigram` columns for press release with id = 16-217

In [125]:
## your code here 
def create_bigram_onedoc(text):
    words = text.split() 
    bigrams = [f"{w1}_{w2}" for w1, w2 in zip(words, words[1:])]
    return " ".join(bigrams)  

doj_subset_wscore["processed_text_bigrams"] = doj_subset_wscore["processed_text"].apply(create_bigram_onedoc)
doj_subset_wscore[doj_subset_wscore["id"] == "16-217"][["id", "processed_text", "processed_text_bigrams"]]

Unnamed: 0,id,processed_text,processed_text_bigrams
313,16-217,reach comprehens settlement agreement citi miami miami polic resolv offic involv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem offic involv shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc offic involv shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim offic involv shoot effect quick investig offic involv shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc agreement today agreement result joint effort citi miami ensur miami polic continu effort make communiti safe protect sacr constitut citizen said ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss vote settlement agreement build upon import reform implement citi sinc issu find includ conduct attorney staff special litig section southern florida,reach_comprehens comprehens_settlement settlement_agreement agreement_citi citi_miami miami_miami miami_polic polic_resolv resolv_offic offic_involv involv_shoot shoot_offic offic_announc announc_princip princip_deputi deputi_general general_vanita vanita_gupta gupta_head head_wifredo wifredo_ferrer ferrer_southern southern_florida florida_settlement settlement_approv approv_miami miami_citi citi_commiss commiss_today today_effect effect_agreement agreement_sign sign_parti parti_resolv resolv_claim claim_stem stem_offic offic_involv involv_shoot shoot_offic offic_conduct conduct_violent violent_crime crime_control control_enforc enforc_find find_issu issu_juli juli_identifi identifi_pattern pattern_practic practic_excess excess_forc forc_offic offic_involv involv_shoot shoot_violat violat_fourth fourth_amend amend_constitut constitut_citi citi_complianc complianc_settlement settlement_monitor monitor_independ independ_review review_former former_tampa tampa_florida florida_polic polic_chief chief_jane jane_castor castor_settlement settlement_agreement agreement_citi citi_implement implement_comprehens comprehens_reform reform_ensur ensur_constitut constitut_polic polic_support support_public public_trust trust_settlement settlement_agreement agreement_design design_minim minim_offic offic_involv involv_shoot shoot_effect effect_quick quick_investig investig_offic offic_involv involv_shoot shoot_occur occur_measur measur_includ includ_settlement settlement_repres repres_renew renew_commit commit_citi citi_miami miami_chief chief_rodolfo rodolfo_llane llane_provid provid_constitut constitut_polic polic_miami miami_resid resid_protect protect_public public_safeti safeti_sustain sustain_reform reform_said said_princip princip_deputi deputi_general general_gupta gupta_agreement agreement_help help_strengthen strengthen_relationship relationship_communiti communiti_serv serv_improv improv_account account_offic offic_fire fire_weapon weapon_unlaw unlaw_provid provid_communiti communiti_particip particip_enforc enforc_agreement agreement_today today_agreement agreement_result result_joint joint_effort effort_citi citi_miami miami_ensur ensur_miami miami_polic polic_continu continu_effort effort_make make_communiti communiti_safe safe_protect protect_sacr sacr_constitut constitut_citizen citizen_said said_ferrer ferrer_oversight oversight_communic communic_agreement agreement_seek seek_make make_perman perman_posit posit_chang chang_former former_chief chief_orosa orosa_chief chief_llane llane_made made_applaud applaud_citi citi_commiss commiss_vote vote_settlement settlement_agreement agreement_build build_upon upon_import import_reform reform_implement implement_citi citi_sinc sinc_issu issu_find find_includ includ_conduct conduct_attorney attorney_staff staff_special special_litig litig_section section_southern southern_florida


C. Use the create_dtm function and the `processed_text_bigrams` column to create a document-term matrix (`dtm_bigram`) with these bigrams. Keep the following three columns in the data: `id`, `topics_clean`, and `compound` 

D. Print the (1) dimensions of the `dtm` matrix from question 2.2  and (2) the dimensions of the `dtm_bigram` matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix 

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the `get_topwords` function from 2.2

In [124]:
# your code here
#C
metadata = doj_subset_wscore[["id", "topics_clean", "compound"]]
dtm_bigram = create_dtm(doj_subset_wscore["processed_text_bigrams"], metadata)

#D
print("dtm shape:",dtm.shape)
print("dtm_bigram.shape:", dtm_bigram.shape)
# The bigram matrix has more dimensions than the unigram matrix because there are more unique word pairs than individual words. 

#E
for topic in dtm_bigram["topics_clean"].unique():
    dtm_topic_subset = dtm_bigram[dtm_bigram["topics_clean"] == topic]
    get_topwords(dtm_topic_subset)

dtm shape: (717, 6869)
dtm_bigram.shape: (717, 72720)


fair_hous         231
deputi_general    221
princip_deputi    221
vanita_gupta      202
gupta_head        200
general_vanita    199
said_princip      186
unit_state        156
nation_origin     143
consent_decre     128
dtype: int64

safe_childhood       474
project_safe         472
child_pornographi    450
child_exploit        281
sexual_exploit       223
exploit_children     200
plead_guilti         197
exploit_obscen       176
obscen_section       175
child_sexual         174
dtype: int64

hate_crime          379
african_american    367
plead_guilti        275
year_prison         161
special_agent       124
racial_motiv        114
thoma_perez         111
grand_juri          101
perez_general        95
said_thoma           91
dtype: int64

# 4. Optional extra credit (2 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings.

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?

In [79]:
# your code here 