# Wikipedia Scraper

This notebook scrapes Wikipedia content for any person by searching through Google and extracting the Wikipedia page content.

## How It Works

1. **Input**: Enter a person's name
2. **Google Search**: Creates a Google search query for the person + "Wikipedia"
3. **URL Extraction**: Finds the Wikipedia URL from search results or constructs it directly
4. **Content Scraping**: Fetches and parses the actual Wikipedia page
5. **Output**: Saves content to a text file and displays statistics

## Features

- **Robust URL Detection**: Multiple methods to find the Wikipedia page
- **Proper Content Extraction**: Scrapes from the actual Wikipedia page (not Google results)
- **Content Filtering**: Focuses on the main article content
- **File Output**: Saves scraped content to a text file
- **Statistics**: Shows character count, word count, and sentence count

## Usage

Simply run all cells below and enter a person's name when prompted. The scraper will automatically find their Wikipedia page and extract the content.

---

In [1]:
# importing the libraries 
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Creating the link 

inp   = input("Enter Persons name: ")

link  = 'https://www.google.com/search?q=' + str(inp) +" "+ "Wikipedia"

link  = link.replace(' ','+')
print(link)

https://www.google.com/search?q=Virat+kohli+Wikipedia


In [3]:
# Sending the request to the link
response = requests.get(link)
response

<Response [200]>

In [4]:
soup  = BeautifulSoup(response.text, 'html.parser')
soup

<!DOCTYPE html>
<html lang="en-IN"><head><title>Google Search</title><style>body{background-color:#fff}</style></head><body><noscript><style>table,div,span,p{display:none}</style><meta content="0;url=/httpservice/retry/enablejs?sei=7oxzaKzFCLyE4-EPytyZ-Q8" http-equiv="refresh"/><div style="display:block">Please click <a href="/httpservice/retry/enablejs?sei=7oxzaKzFCLyE4-EPytyZ-Q8">here</a> if you are not redirected within a few seconds.</div></noscript><script nonce="CKK9X3rNtWEnKXG58_SWLw">(function(){var sctm=false;(function(){sctm&&google.tick("load","pbsst");}).call(this);})();</script><script nonce="CKK9X3rNtWEnKXG58_SWLw" src="//www.google.com/js/bg/YepF6H9Jy0TSx9wszDTCKm28JbhMG6AHuJbvQ2mzd74.js"></script><script nonce="CKK9X3rNtWEnKXG58_SWLw">(function(){var r='1';var ce=30;var sctm=false;var p='sbcAbuWQq71x1V1iYdGCOH81/NxKi/8bB8Q0Djh+JS6zl5NdEUPnm8GbGvh8y0oHoMn4D36lm7MKMtlwO8fHcA/sPzD1HYm/XAAY0ZzTpl8s5dUmzZmIxJ2B2H/DDeFkRoMq4fEBCNDjehemyDX4N3M/M8a26lk+R81hDAQSxMN6pq4dj+51GUj71

In [5]:
# More robust approach to get Wikipedia URL
import urllib.parse

# Method 1: Try to extract from Google search results
wikipedia_url = None

# Look for Wikipedia links in search results with better patterns
for link_tag in soup.find_all('a', href=True):
    href = link_tag.get('href')
    if href:
        # Check if it's a Google redirect to Wikipedia
        if '/url?q=' in href and 'en.wikipedia.org' in href:
            # Extract the actual URL
            try:
                parsed_url = urllib.parse.parse_qs(urllib.parse.urlparse(href).query)
                if 'q' in parsed_url:
                    wikipedia_url = parsed_url['q'][0]
                    break
            except:
                pass
        # Direct Wikipedia link (less common in Google results)
        elif 'en.wikipedia.org' in href and href.startswith('http'):
            wikipedia_url = href
            break

# Method 2: If not found, construct Wikipedia URL directly
if not wikipedia_url:
    print("Trying direct Wikipedia search...")
    # Clean the search term for Wikipedia URL
    clean_name = inp.strip().replace(' ', '_')
    wikipedia_url = f"https://en.wikipedia.org/wiki/{clean_name}"
    
    # Test if this Wikipedia page exists
    test_response = requests.get(wikipedia_url)
    if test_response.status_code == 200 and 'Wikipedia' in test_response.text:
        print(f"Successfully found Wikipedia page at: {wikipedia_url}")
    else:
        # Try a Wikipedia search if direct page doesn't exist
        search_url = f"https://en.wikipedia.org/wiki/Special:Search?search={urllib.parse.quote(inp)}"
        print(f"Direct page not found, trying search: {search_url}")
        wikipedia_url = search_url

print(f"Final Wikipedia URL: {wikipedia_url}")

Trying direct Wikipedia search...
Successfully found Wikipedia page at: https://en.wikipedia.org/wiki/Virat_kohli
Final Wikipedia URL: https://en.wikipedia.org/wiki/Virat_kohli
Successfully found Wikipedia page at: https://en.wikipedia.org/wiki/Virat_kohli
Final Wikipedia URL: https://en.wikipedia.org/wiki/Virat_kohli


In [6]:
# scraping the paragraphs from the actual wikipedia page
if wikipedia_url:
    # Make a request to the actual Wikipedia page
    wiki_response = requests.get(wikipedia_url)
    wiki_soup = BeautifulSoup(wiki_response.text, 'html.parser')
    
    # Extract paragraphs from the Wikipedia page
    paragraphs = ''
    
    # Find the main content div in Wikipedia
    content_div = wiki_soup.find('div', {'id': 'mw-content-text'})
    
    if content_div:
        # Get all paragraphs within the content area
        for p in content_div.find_all('p'):
            paragraphs += p.text.strip()
            paragraphs += '\n\n'
    else:
        # Fallback: get all paragraphs if main content div not found
        for p in wiki_soup.find_all('p'):
            paragraphs += p.text.strip()
            paragraphs += '\n\n'
    
    paragraphs = paragraphs.strip()
    print("Wikipedia content scraped successfully!")
    print(f"Content length: {len(paragraphs)} characters")
    print("\nFirst 500 characters:")
    print(paragraphs[:500] + "..." if len(paragraphs) > 500 else paragraphs)
else:
    print("Cannot scrape Wikipedia content - no valid Wikipedia URL found")

Wikipedia content scraped successfully!
Content length: 73879 characters

First 500 characters:
Virat Kohli (born 5 November 1988) Hindi pronunciation: [ʋɪˈɾaːʈᵊ ˈkoːɦᵊliː] ⓘ is an Indian international cricketer and the former captain of the Indian national cricket team. He is a right-handed batsman and an occasional medium-fast bowler. He currently represents Royal Challengers Bengaluru in the IPL and Delhi in domestic cricket. Kohli is widely regarded as one of the greatest batters of all time.[3] He also holds the record for scoring the most centuries in ODI cricket and stands second in...


In [None]:
#  Save the scraped content to a file
if 'paragraphs' in locals() and paragraphs:
    # Create a filename based on the search term
    filename = f"{inp.replace(' ', '_')}_wikipedia.txt"
    
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(f"Wikipedia content for: {inp}\n")
        f.write(f"Source URL: {wikipedia_url}\n")
        f.write("="*50 + "\n\n")
        f.write(paragraphs)
    
    print(f"Content saved to: {filename}")
    
    # Display some statistics
    word_count = len(paragraphs.split())
    sentence_count = paragraphs.count('.') + paragraphs.count('!') + paragraphs.count('?')
    
    print(f"\nContent Statistics:")
    print(f"- Character count: {len(paragraphs)}")
    print(f"- Word count: {word_count}")
    print(f"- Approximate sentence count: {sentence_count}")
else:
    print("No content to save")

Content saved to: Virat_kohli_wikipedia.txt

Content Statistics:
- Character count: 73879
- Word count: 11882
- Approximate sentence count: 603
