In [1]:
#importing Beautiful Soup and URLLIB library for web scraping
import bs4 as bs
import urllib.request
import re

In [2]:
scraped_data = urllib.request.urlopen("https://en.wikipedia.org/wiki/Chi-squared_test")
article = scraped_data.read()

In [3]:
len(article)

158404

In [4]:
parsed_article = bs.BeautifulSoup(article,'lxml')
parsed_article


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Chi-squared test - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"2ba43e63-dc0d-4bfe-9378-0ef5f2fd9e15","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Chi-squared_test","wgTitle":"Chi-squared test","wgCurRevisionId":983024096,"wgRevisionId":983024096,"wgArticleId":226680,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with incomplete citations","Articles with incomplete citations from January 2013","Statistical tests for contingency tables","Nonparametric statisti

In [5]:
len(parsed_article)

2

In [6]:
paragraphs = parsed_article.find_all('p')

In [7]:
article_text = ""

In [8]:
for p in paragraphs:
    article_text += p.text

In [9]:
# this is our final scraped text data 
article_text

'A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson\'s chi-squared test and variants thereof. Pearson\'s chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. \nIn the standard applications of this test, the observations are classified into mutually exclusive classes. If the null hypothesis that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a χ2 frequency distribution. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.\nTest statistics that follow a χ2 distribution occur when the observations are independent and normally distributed, wh

In [10]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [11]:
stop_words = list(STOP_WORDS)
print(stop_words)

['other', 'into', 'on', 'cannot', 'both', 'much', 'be', 'them', '‘m', 'rather', 'anywhere', 'too', 'hence', 'n‘t', 'noone', 'anything', 'someone', 'hundred', 'why', 'ca', 'would', 'can', 'already', 'thereby', 'sixty', 'whence', 'me', 'used', 'three', 'namely', 'yours', 'last', 'each', 'take', 'had', 'must', 'could', 'meanwhile', 'first', 'thence', "'ve", 'across', 'done', 'again', 'the', 'may', 'back', 'various', 'whose', 'these', 'out', 'any', 'whom', 'made', 'neither', 'others', 'two', 'most', 'those', 'towards', 'forty', 'never', 'about', 'doing', 'thereupon', 'therefore', 'up', 'name', 'hereafter', 'everywhere', 'should', 'a', 'among', 'do', '‘ll', 'front', 'amount', 'beyond', 'herein', 'former', 'yourselves', 'seemed', 'many', 'eleven', 'using', 'sometimes', 'thus', 'under', 'off', "'ll", 'eight', 'it', 'own', 'anyone', 'therein', 'did', 'from', 'call', 'perhaps', 'will', 'either', 'formerly', 'next', 'am', 'nine', 'whenever', '’ll', 'without', 'empty', 'who', 'five', 'are', 'is',

In [12]:
#creation of spacy object
nlp = spacy.load('en_core_web_sm')

In [13]:
#loading the text data to use it for spacy
doc = nlp(article_text)

In [14]:
#tokenization of words
token = [token.text for token in doc]
print(token)

['A', 'chi', '-', 'squared', 'test', ',', 'also', 'written', 'as', 'χ2', 'test', ',', 'is', 'a', 'statistical', 'hypothesis', 'test', 'that', 'is', 'valid', 'to', 'perform', 'when', 'the', 'test', 'statistic', 'is', 'chi', '-', 'squared', 'distributed', 'under', 'the', 'null', 'hypothesis', ',', 'specifically', 'Pearson', "'s", 'chi', '-', 'squared', 'test', 'and', 'variants', 'thereof', '.', 'Pearson', "'s", 'chi', '-', 'squared', 'test', 'is', 'used', 'to', 'determine', 'whether', 'there', 'is', 'a', 'statistically', 'significant', 'difference', 'between', 'the', 'expected', 'frequencies', 'and', 'the', 'observed', 'frequencies', 'in', 'one', 'or', 'more', 'categories', 'of', 'a', 'contingency', 'table', '.', '\n', 'In', 'the', 'standard', 'applications', 'of', 'this', 'test', ',', 'the', 'observations', 'are', 'classified', 'into', 'mutually', 'exclusive', 'classes', '.', 'If', 'the', 'null', 'hypothesis', 'that', 'there', 'are', 'no', 'differences', 'between', 'the', 'classes', 'in

In [15]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
#adding "\n" into punctuation list 
punctuation  = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [17]:
#creation of word frequency table
word_frequencies = { }
for words in doc:
    if words.text.lower() not in stop_words:
        if words.text.lower() not in punctuation:
            if words.text.isnumeric() is False:
                if words.text not in word_frequencies.keys():
                    word_frequencies[words.text] = 1
                else:
                    word_frequencies[words.text] += 1
                

In [18]:
word_frequencies

{'chi': 19,
 'squared': 20,
 'test': 32,
 'written': 1,
 'χ2': 9,
 'statistical': 4,
 'hypothesis': 12,
 'valid': 2,
 'perform': 1,
 'statistic': 9,
 'distributed': 5,
 'null': 10,
 'specifically': 1,
 'Pearson': 15,
 'variants': 1,
 'thereof': 1,
 'determine': 2,
 'statistically': 1,
 'significant': 2,
 'difference': 3,
 'expected': 7,
 'frequencies': 4,
 'observed': 5,
 'categories': 2,
 'contingency': 2,
 'table': 3,
 'standard': 1,
 'applications': 2,
 'observations': 9,
 'classified': 2,
 'mutually': 2,
 'exclusive': 2,
 'classes': 3,
 'differences': 1,
 'population': 6,
 'true': 5,
 'computed': 1,
 'follows': 3,
 'frequency': 1,
 'distribution': 25,
 'purpose': 1,
 'evaluate': 1,
 'likely': 1,
 'assuming': 2,
 'Test': 1,
 'statistics': 1,
 'follow': 1,
 'occur': 1,
 'independent': 2,
 'normally': 3,
 'assumptions': 1,
 'justified': 1,
 'central': 1,
 'limit': 2,
 'theorem': 1,
 'tests': 5,
 'testing': 2,
 'independence': 5,
 'pair': 1,
 'random': 3,
 'variables': 1,
 'based': 2,


In [19]:
max_frequency = max(word_frequencies.values())
max_frequency

32

In [20]:
max_key = max(word_frequencies,key = word_frequencies.get)
max_key

'test'

In [21]:
#finding the weighted frequency to be use by the model
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [22]:
word_frequencies

{'chi': 0.59375,
 'squared': 0.625,
 'test': 1.0,
 'written': 0.03125,
 'χ2': 0.28125,
 'statistical': 0.125,
 'hypothesis': 0.375,
 'valid': 0.0625,
 'perform': 0.03125,
 'statistic': 0.28125,
 'distributed': 0.15625,
 'null': 0.3125,
 'specifically': 0.03125,
 'Pearson': 0.46875,
 'variants': 0.03125,
 'thereof': 0.03125,
 'determine': 0.0625,
 'statistically': 0.03125,
 'significant': 0.0625,
 'difference': 0.09375,
 'expected': 0.21875,
 'frequencies': 0.125,
 'observed': 0.15625,
 'categories': 0.0625,
 'contingency': 0.0625,
 'table': 0.09375,
 'standard': 0.03125,
 'applications': 0.0625,
 'observations': 0.28125,
 'classified': 0.0625,
 'mutually': 0.0625,
 'exclusive': 0.0625,
 'classes': 0.09375,
 'differences': 0.03125,
 'population': 0.1875,
 'true': 0.15625,
 'computed': 0.03125,
 'follows': 0.09375,
 'frequency': 0.03125,
 'distribution': 0.78125,
 'purpose': 0.03125,
 'evaluate': 0.03125,
 'likely': 0.03125,
 'assuming': 0.0625,
 'Test': 0.03125,
 'statistics': 0.03125,


In [23]:
import pandas as pd
df = pd.DataFrame(list(word_frequencies.items()), columns = ['Word','Frequency'])
df

Unnamed: 0,Word,Frequency
0,chi,0.59375
1,squared,0.62500
2,test,1.00000
3,written,0.03125
4,χ2,0.28125
...,...,...
351,belonging,0.03125
352,disease,0.03125
353,essential,0.03125
354,chromosome,0.03125


In [24]:
#tokenization of sentence
sentence_token = [sentence for sentence in doc.sents]
print(sentence_token)

[A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof., Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. 
, In the standard applications of this test, the observations are classified into mutually exclusive classes., If the null hypothesis that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a χ2 frequency distribution., The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.
, Test statistics that follow a χ2 distribution occur when the observations are independent and normally distributed,

In [25]:
sentence_score = {}

for sentence in sentence_token:
    for word in sentence:
        if word.text.lower() in word_frequencies.keys():
            if sentence not in sentence_score.keys():
                sentence_score[sentence] = word_frequencies[word.text.lower()]
            else:
                sentence_score[sentence] += word_frequencies[word.text.lower()]

In [26]:
sentence_score

{A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof.: 10.78125,
 Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. : 3.3125,
 In the standard applications of this test, the observations are classified into mutually exclusive classes.: 1.65625,
 If the null hypothesis that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a χ2 frequency distribution.: 3.9375,
 The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.: 2.28125,
 Test statistics that follow a χ2 distribution occur when the observa

In [27]:
#to select n-largest weighted sentence from the text data
from heapq import nlargest

In [28]:
#extracting out only 20% of the text data as summary
summary_length = int(len(sentence_token)*0.2)

In [29]:
summary_length

11

Thus, we will extract out only the top 16 sentence based on the weighted frequency alloted to frame its summary.

In [30]:
summary = nlargest(summary_length,sentence_score,key = sentence_score.get)

In [31]:
summary

[A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof.,
 Chi-squared tests often refers to tests for which the distribution of the test statistic approaches the χ2 distribution asymptotically, meaning that the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase.,
 Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom are
 If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.,
 Using the chi-squared distribution to interpret Pearson's chi-squared statistic requires one to assume that the discrete probability of observed binomial frequencies in th

In [32]:
final_summary = [word.text for word in summary]

In [33]:
final_summary

["A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof.",
 'Chi-squared tests often refers to tests for which the distribution of the test statistic approaches the χ2 distribution asymptotically, meaning that the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase.\n',
 'Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom are\nIf the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.\n',
 "Using the chi-squared distribution to interpret Pearson's chi-squared statistic requires one to assume that the discrete probability of observed binomial freque

In [34]:
final_summary = " ".join(final_summary)

In [35]:
final_summary

"A chi-squared test, also written as χ2 test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Chi-squared tests often refers to tests for which the distribution of the test statistic approaches the χ2 distribution asymptotically, meaning that the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase.\n Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom are\nIf the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.\n Using the chi-squared distribution to interpret Pearson's chi-squared statistic requires one to assume that the discrete probability of observed binomial frequencies in the 

In [36]:
len(article_text)

8863

In [37]:
len(final_summary)

2686