<a href="https://colab.research.google.com/github/RanjanaRaghavan/PyVentures/blob/main/Day2/htmlParser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Web Scraping & Data Handling 📊🌐

This code brings together web scraping and data manipulation! First, it extracts the title of a Wikipedia article using BeautifulSoup by scraping the h1 element with the id firstHeading. In this case, it fetches the title of the Interstellar film article. Next, the script handles data from a CSV file using pandas. It converts the median_income column to numeric values, filling any errors with -1. Finally, it sorts the DataFrame based on the median_income column in ascending order, which could be useful for tasks like analyzing income distribution or sorting by specific criteria.

In [10]:
import bs4, requests

def getWikipediaTitle(articleUrl):
    res = requests.get(articleUrl)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, "html.parser")

    # Find the title element using its id
    titleElem = soup.find('h1', id='firstHeading')

    if titleElem:
        return titleElem.text.strip()
    else:
        return "Title not found"

title = getWikipediaTitle('https://en.wikipedia.org/wiki/Interstellar_(film)')
print('The title is: ' + title)

The title is: Interstellar (film)


In [11]:
import pandas as pd
import math

# Read CSV data into a DataFrame
df = pd.read_csv('/content/sample_data/california_housing_test.csv')

# Check for Age and set default value as -1
df['median_income'] = pd.to_numeric(df['median_income'], errors='coerce').fillna(-1)


# Sort the DataFrame by 'age' column in ascending order
df_sorted = df.sort_values(by='median_income', ascending=True)  # Set ascending=False for descending order

# Display the sorted DataFrame
print("\nSorted DataFrame:")
print(df_sorted)



Sorted DataFrame:
      longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
185     -118.28     34.02                29.0        515.0           229.0   
2640    -114.62     33.62                26.0         18.0             3.0   
2879    -121.84     38.02                46.0         66.0            22.0   
641     -121.04     37.67                16.0         19.0            19.0   
2841    -118.27     33.96                38.0       1126.0           270.0   
...         ...       ...                 ...          ...             ...   
2583    -118.41     34.09                37.0       2716.0           302.0   
42      -118.06     34.15                37.0       1980.0           226.0   
2199    -118.20     34.19                38.0       2176.0           266.0   
1383    -118.37     34.10                37.0        407.0            67.0   
161     -117.85     33.62                13.0       5192.0           658.0   

      population  households  median_income 