# Scraping wikipedia's page about Machine Learning

Let's scrape this Wikipedia page [https://en.wikipedia.org/wiki/Machine_learning](https://en.wikipedia.org/wiki/Machine_learning) to recover all the paragraphs! This is a very common task in data science, especially in the field of Natural Language Processing (NLP) where one needs lots of text samples.

1. Create a class `WikipediaSpider(scrapy.Spider)` with `start_urls = ['https://en.wikipedia.org/wiki/Machine_learning']`. In this class, define a `parse(self, response)` method that allows to recover all the paragraphs. Then, declare a `CrawlerProcess` that will store the results in a file named `"wikipedia-machine_learning.json"`.

In [1]:
!pip install Scrapy



In [2]:
# Import libraries
import os 
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

In [3]:
# Create your class WikipediaSpider(scrapy.Spider)
class WikipediaSpider(scrapy.Spider):
    # Name of your spider
    name = "wiki"

    # Url to start your spider from 
    start_urls = [
        'https://en.wikipedia.org/wiki/Machine_learning',
    ]

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of the first <div> with class="quote"
    def parse(self, response):
        text_block = response.css('div.mw-parser-output p')
        for text in text_block:
            yield {
                'text': text.css('::text').getall()
            }

In [4]:
# Define the file wikipedia-machine_learning.json, check if it exists, declare 
# the settings and start crawling!
filename = "wikipedia-machine_learning.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if not os.path.exists('./saving'):
    os.mkdir('./saving')
if filename in os.listdir('saving/'):
    os.remove('saving/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'saving/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(WikipediaSpider)
process.start()

2020-12-12 18:57:20 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2020-12-12 18:57:20 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Oct  8 2020, 12:12:24) - [GCC 8.4.0], pyOpenSSL 20.0.0 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
2020-12-12 18:57:20 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-12-12 18:57:20 [scrapy.extensions.telnet] INFO: Telnet Password: 5c3d3a6e3b64cf7d
2020-12-12 18:57:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-12-12 18:57:20 [scrapy.middleware] INFO: Enabled downloader midd

2. Use pandas to read the file "wikipedia-machine_learning.json" and store the data into a DataFrame. Display the five first rows of the dataset.

In [5]:
# Import here
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [6]:
# Read the file
df = pd.read_json('./saving/wikipedia-machine_learning.json')

2020-12-12 18:57:20 [numexpr.utils] INFO: NumExpr defaulting to 2 threads.


In [7]:
# Display the 5 firsts rows
df.head()

Unnamed: 0,text
0,"[Machine learning, (, ML, ) is the study of computer algorithms that improve automatically through experience., [1], It is seen as a subset of , artificial intelligence, . Machine learning algorithms build a model based on sample data, known as "", training data, "", in order to make predictions or decisions without being explicitly programmed to do so., [2], Machine learning algorithms are used in a wide variety of applications, such as , email filtering, and , computer vision, , where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.\n]"
1,"[A subset of machine learning is closely related to , computational statistics, , which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of , mathematical optimization, delivers methods, theory and application domains to the field of machine learning. , Data mining, is a related field of study, focusing on , exploratory data analysis, through , unsupervised learning, ., [4], [5], In its application across business problems, machine learning is also referred to as , predictive analytics, .\n]"
2,"[Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks. For simple tasks assigned to computers, it is possible to program algorithms telling the machine how to execute all steps required to solve the problem at hand; on the computer's part, no learning is needed. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify every needed step., [6], \n]"
3,"[The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential answers exist, one approach is to label some of the correct answers as valid. This can then be used as training data for the computer to improve the algorithm(s) it uses to determine correct answers. For example, to train a system for the task of digital character recognition, the , MNIST, dataset of handwritten digits has often been used., [6], \n]"
4,"[\nMachine learning approaches are traditionally divided into three broad categories, depending on the nature of the ""signal"" or ""feedback"" available to the learning system:\n]"


As you can see, each row of the dataset contains a list of `str` objects. In the next steps, we will make some preprocessings to clean each paragraph and then store it into a unique `str` object.

3. Display the list representing the first paragraph:

In [8]:
df['text'][72]

['The evolvement of AI systems raises a lot questions in the realm of ethics and morality. AI can be well equipped in making decisions in certain fields such technical and scientific which rely\nheavily on data and historical information. These decisions rely on objectivity and logical reasoning.',
 '[111]',
 ' Because human languages contain biases, machines trained on language ',
 'corpora',
 ' will necessarily also learn these biases.',
 '[112]',
 '[113]',
 '\n']

You can notice that there are some elements in the list that have the format `"[...]"`. Actually, these elements correspond to reference links that we're not interested in.

4. Define a function `remove_references(text_list)` that removes all the references from a given list. Then use pandas `.apply()` method to remove all the references from the lists, for each row in the DataFrame. Check that it works by displaying the first paragraph once again :

In [9]:
# Define remove_references function here
def remove_references(dataframe):   
    for paragraphe in dataframe:
        #FIRST get all element we want to dump
        element_to_remove = []
        for text in paragraphe:
            if "[" and "]" in text:
                element_to_remove.append(text)
        #SECOND suppress all dump element
        for dump in element_to_remove:
            paragraphe.remove(dump)

#Apply our function on our Dataframe
df.apply(remove_references)

text    None
dtype: object

In [10]:
# Sanity check: look at the same row as in the previous question
df['text'][72]

['The evolvement of AI systems raises a lot questions in the realm of ethics and morality. AI can be well equipped in making decisions in certain fields such technical and scientific which rely\nheavily on data and historical information. These decisions rely on objectivity and logical reasoning.',
 ' Because human languages contain biases, machines trained on language ',
 'corpora',
 ' will necessarily also learn these biases.',
 '\n']

5. Have a look at pandas <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods" target="_blank">str module</a> and find a method that allows to create a unique string from a list of strings, by concatenating all the elements. Use this method to change your dataset such that each row contains one unique `str` object with the whole paragraph.

In [11]:
methode_to_use = 2

#FIRST METHODE
if methode_to_use == 1:
    print("Methode 1 used: ")
    new_string = ""
    for text in df['text'][0]:
        new_string += text
    df['text'][0] = new_string
    print(df['text'][0])
    
#SECOND METHODE
elif methode_to_use == 2:
    print("Methode 2 used: ")
    df['text'][0] = "".join(df['text'][0])
    print(df['text'][0])
    
else:
    print("No methode is used!")


Methode 2 used: 
Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.



6. Loop over all the dataset's rows and print them in order to check that you get all the cleaned paragraphs:

In [12]:
methode_to_use = 2

#FIRST METHODE
if methode_to_use == 1:
    print("Methode 1 used: ")
    for paragraphe, index in zip(df['text'], df.index):
        new_string = ""
        for text in paragraphe:
            new_string += text        
        df['text'][index] = new_string
#SECOND METHODE
elif methode_to_use == 2:
    print("Methode 2 used: ")
    for paragraphe, index in zip(df['text'], df.index):
        df['text'][index] = "".join(paragraphe)
        #because I don't like \n
        df['text'][index] = df['text'][index].replace("\n", " ")
else:
    print("No methode is used!")
    
    
#Display dataframe modified
display(df)

Methode 2 used: 


Unnamed: 0,text
0,"Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as ""training data"", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks."
1,"A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."
2,"Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks. For simple tasks assigned to computers, it is possible to program algorithms telling the machine how to execute all steps required to solve the problem at hand; on the computer's part, no learning is needed. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify every needed step."
3,"The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential answers exist, one approach is to label some of the correct answers as valid. This can then be used as training data for the computer to improve the algorithm(s) it uses to determine correct answers. For example, to train a system for the task of digital character recognition, the MNIST dataset of handwritten digits has often been used."
4,"Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the ""signal"" or ""feedback"" available to the learning system:"
...,...
71,"Machine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices. For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants. Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning."
72,"The evolvement of AI systems raises a lot questions in the realm of ethics and morality. AI can be well equipped in making decisions in certain fields such technical and scientific which rely heavily on data and historical information. These decisions rely on objectivity and logical reasoning. Because human languages contain biases, machines trained on language corpora will necessarily also learn these biases."
73,"Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest but as income-generating machines. This is especially true in the United States where there is a long-standing ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these ""greed"" biases are addressed."
74,"Since the 2010s, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (a particular narrow subdomain of machine learning) that contain many layers of non-linear hidden units. By 2019, graphic processing units (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method of training large-scale commercial cloud AI. OpenAI estimated the hardware compute used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-fold increase in the amount of compute required, with a doubling-time trendline of 3.4 months."
