<a href="https://colab.research.google.com/github/Tyriek-cloud/Wikipedia-Article-Analyzer-NLP-Streamlit-App/blob/main/Wikipedia_Article_Analyzer_NLP_Streamlit_App.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wikipedia Article Analyzer NLP Streamlit App

This project is designed to scrape Articles from Wikipedia. By default, the Streamlit application will display the Wikipedia Statistics page (https://en.wikipedia.org/wiki/Statistics). The user will have the option to insert their own Wikipedia URLs. The end goal is to use Natural Language Processing to create a summary (with an emphasis on important and easily digestable details) from Wikipedia articles and display important images/pictures/graphics from each of the articles.


In [None]:
# There are a few librabries that should be installed to make this project run smoothly
# Other librabries are already accessible through an import

!pip install nltk
!pip install beautifulsoup4
!pip install -q streamlit
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.1.2-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.1.2


In [None]:
# Now we will install the necessary (and/or potentially useful) libraries for this process
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import spacy
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
import heapq
import streamlit as st
from pyngrok import ngrok

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# To use the spaCy model and the NLTK stopwords, we need to actually download it in the notebook
nlp = spacy.load("en_core_web_sm")
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# This function will summarize any necessary text into about 10 or so sentences
def summarize_text(text, num_sentences=10):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    summary = " ".join(sentences[:num_sentences])
    return summary

In [None]:
# Function to extract live URLs from the references section
def extract_live_urls(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    references_section = soup.find("span", {"id": "References"})
    if references_section:
        references = references_section.find_next("ul")
        if references:
            urls = [a['href'] for a in references.find_all('a', href=True)]
            return urls

    return []

In [None]:
# Now to define the Main function to allow BeautifulSoup to parse the data
def main():
    statistics_url = "https://en.wikipedia.org/wiki/Statistics"
    response = requests.get(statistics_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    content = soup.find("div", class_="mw-parser-output")
    paragraphs = content.find_all("p")

    # This will extra the text and print out the summary
    full_text = "\n".join([p.text for p in paragraphs])
    summary = summarize_text(full_text)

    print("Summary:")
    print(summary)

# Summary Generation

In [None]:
 # This pulls live URLs from the references section at the bottom of Wikipedia articles
statistics_url = "https://en.wikipedia.org/wiki/Statistics"
live_urls = extract_live_urls(statistics_url)
print("\nLive URLs from References:")
for url in live_urls:
  print(url)

if __name__ == "__main__":
    main()


Live URLs from References:
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/978-0134705217
/wiki/ISBN_(identifier)
/wiki/Special:BookSources/0702172863
Summary:
Statistics (from German: Statistik, orig. "description of a state, a country")[1][2] is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.[3][4][5] In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.[6]

 When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusi

In [None]:
# Now to pull in the urls for all of the images in the Wikipedia articles
def extract_images(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Finds all the image tags
    img_tags = soup.find_all('img')

    # Extracts all the image URLs
    image_urls = [urljoin(url, img['src']) for img in img_tags]

    return image_urls

# Pulling in Image URLs

In [None]:
# Calling in the main function for image extraction
def main():
    statistics_url = "https://en.wikipedia.org/wiki/Statistics"
    image_urls = extract_images(statistics_url)

    # Print the image URLs into a neat list
    for idx, url in enumerate(image_urls, start=1):
        print(f"Image {idx}: {url}")

if __name__ == "__main__":
    main()

Image 1: https://en.wikipedia.org/static/images/icons/wikipedia.png
Image 2: https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg
Image 3: https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en.svg
Image 4: https://upload.wikimedia.org/wikipedia/commons/thumb/4/40/Fisher_iris_versicolor_sepalwidth.svg/100px-Fisher_iris_versicolor_sepalwidth.svg.png
Image 5: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Nuvola_apps_edu_mathematics_blue-p.svg/16px-Nuvola_apps_edu_mathematics_blue-p.svg.png
Image 6: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Nuvola_apps_edu_mathematics_blue-p.svg/20px-Nuvola_apps_edu_mathematics_blue-p.svg.png
Image 7: https://upload.wikimedia.org/wikipedia/commons/thumb/4/44/Standard_Normal_Distribution.png/290px-Standard_Normal_Distribution.png
Image 8: https://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Iris_Pairs_Plot.svg/290px-Iris_Pairs_Plot.svg.png
Image 9: https://upload.wikimedia.

Now we will build out the actual file for the web application (app.py). The code may have to be modified slightly so that it can actually be used for production purposes.

Note: Depending on the structure of the Wikipedia article, some sections of the Streamlit app will not load.

In [67]:
# Now to define the Streamlit app (with some modifications)
%%writefile app.py
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import spacy
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
import heapq
import streamlit as st
from pyngrok import ngrok

# Download spaCy model
spacy.cli.download("en_core_web_sm")

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Summarize  text
def summarize_text(text, num_sentences=10):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    summary = " ".join(sentences[:num_sentences])
    return summary

# Extract live URLs from the references section
def extract_live_urls(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    references_section = soup.find("span", {"id": "References"})
    if references_section:
        references = references_section.find_next("ul")
        if references:
            urls = [a['href'] for a in references.find_all('a', href=True)]
            return urls

    return []

# Extracts images
def extract_images(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # All image tags are called in
    img_tags = soup.find_all('img')

    # Extracts image URLs
    image_urls = [urljoin(url, img['src']) for img in img_tags]

    return image_urls

# Main function
def main():
    st.title("Wikipedia Article Analyzer")

    # Sidebar for user input (if any)
    st.sidebar.header("User Input")
    url_input = st.sidebar.text_input("Enter Wikipedia URL:", "https://en.wikipedia.org/wiki/Statistics")

    # Main content
    if st.button("Analyze"):
        # Extract and summarize the text
        response = requests.get(url_input)
        soup = BeautifulSoup(response.text, 'html.parser')
        content = soup.find("div", class_="mw-parser-output")
        paragraphs = content.find_all("p")
        full_text = "\n".join([p.text for p in paragraphs])
        summary = summarize_text(full_text)

        # Display summary
        st.subheader("Summary:")
        st.write(summary)

        # Extract live URLs from the references section
        live_urls = extract_live_urls(url_input)
        st.subheader("Live URLs from References:")
        for url in live_urls:
            st.write(url)

        # Extract images
        image_urls = extract_images(url_input)
        st.subheader("Image URLs:")
        for idx, url in enumerate(image_urls, start=1):
            st.write(url)

if __name__ == "__main__":
    main()

Overwriting app.py


In [None]:
# This will give me the password to run Streamlit locally (although technically Google Colab is on the cloud)
! wget -q -O - ipv4.icanhazip.com

35.236.237.93


In [68]:
# Runs Streamlit
! streamlit run app.py & npx localtunnel --port 8501

[..................] | fetchMetadata: sill resolveWithNewModule localtunnel@2.0[0m[K
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.236.237.93:8501[0m
[0m
[K[?25hnpx: installed 22 in 5.557s
your url is: https://clean-flowers-press.loca.lt
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[nltk_data] Downloa