# Exercises: Day 20
__1. Read this url and find the 10 most frequent words. romeo_and_juliet = 'http://www.gutenberg.org/files/1112/1112.txt'__

In [5]:
import requests
from collections import Counter
import re

# Fetching the text from the URL
url = 'https://www.gutenberg.org/files/1112/1112-0.txt'
response = requests.get(url)
if response.status_code == 200:
    # Extracting text content
    text = response.text
    
    # Removing punctuation and converting text to lowercase
    text = re.sub(r'[^\w\s]', '', text).lower()
    
    # Splitting the text into words
    words = text.split()

    # Counting the frequency of each word
    word_freq = Counter(words)
    
    # Getting the 10 most common words
    most_common_words = word_freq.most_common(10)
    
    print("The 10 most frequent words are:")
    for word, frequency in most_common_words:
        print(f"{word}: {frequency}")
else:
    print("Failed to fetch the content from the URL")
    print('status_code is: ', response.status_code)



The 10 most frequent words are:
the: 844
and: 761
to: 630
i: 597
a: 528
of: 503
in: 376
my: 374
you: 363
is: 362


__2. Read the cats API and cats_api = 'https://api.thecatapi.com/v1/breeds' and find :__
   1. the min, max, mean, median, standard deviation of cats' weight in metric units.
   2. the min, max, mean, median, standard deviation of cats' lifespan in years.
   3. Create a frequency table of country and breed of cats

In [8]:
import requests
import statistics
from collections import Counter

cats_api = 'https://api.thecatapi.com/v1/breeds'

# Fetching data from the API
response = requests.get(cats_api)

if response.status_code == 200:
    cat_data = response.json()
    
    # Lists to store weights and lifespans
    weights = []
    lifespans = []
    
    # Dictionary for frequency table of country and breed
    country_breed_freq = Counter()
    
    for cat in cat_data:
        # Handling weight data
        if 'weight' in cat and 'metric' in cat['weight']:
            weight_str = cat['weight']['metric']
            weight = float(weight_str.split()[0]) if weight_str else None
            if weight:
                weights.append(weight)
        
        # Handling lifespan data
        if 'life_span' in cat:
            lifespan_str = cat['life_span']
            lifespan = float(lifespan_str.split()[0]) if lifespan_str else None
            if lifespan:
                lifespans.append(lifespan)
        
        # Some entries might not have 'country_code' or 'name' field, so check if they exist
        if 'country_code' in cat and 'name' in cat:
            country_code = cat['country_code']
            breed_name = cat['name']
            
            # Incrementing the count for the (country, breed) pair
            country_breed_freq[(country_code, breed_name)] += 1
    
    # Calculations for weights
    min_weight = min(weights) if weights else None
    max_weight = max(weights) if weights else None
    mean_weight = statistics.mean(weights) if weights else None
    median_weight = statistics.median(weights) if weights else None
    stdev_weight = statistics.stdev(weights) if len(weights) > 1 else None
    
    # Calculations for lifespans
    min_lifespan = min(lifespans) if lifespans else None
    max_lifespan = max(lifespans) if lifespans else None
    mean_lifespan = statistics.mean(lifespans) if lifespans else None
    median_lifespan = statistics.median(lifespans) if lifespans else None
    stdev_lifespan = statistics.stdev(lifespans) if len(lifespans) > 1 else None
    
    # Displaying results
    print("Statistics for cat weights (in metric units):")
    print(f"Min: {min_weight}")
    print(f"Max: {max_weight}")
    print(f"Mean: {mean_weight}")
    print(f"Median: {median_weight}")
    print(f"Standard Deviation: {stdev_weight}")
    
    print("\nStatistics for cat lifespans (in years):")
    print(f"Min: {min_lifespan}")
    print(f"Max: {max_lifespan}")
    print(f"Mean: {mean_lifespan}")
    print(f"Median: {median_lifespan}")
    print(f"Standard Deviation: {stdev_lifespan}")
    
    print("\nFrequency table of country and breed of cats:")
    for (country, breed), count in country_breed_freq.items():
        print(f"{country} - {breed}: {count}")
else:
    print("Failed to fetch data from the API")
    print('status_code is: ', response.status_code)


Statistics for cat weights (in metric units):
Min: 2.0
Max: 5.0
Mean: 3.2238805970149254
Median: 3.0
Standard Deviation: 0.8845628182703051

Statistics for cat lifespans (in years):
Min: 8.0
Max: 18.0
Mean: 12.074626865671641
Median: 12.0
Standard Deviation: 1.8283411328456127

Frequency table of country and breed of cats:
EG - Abyssinian: 1
GR - Aegean: 1
US - American Bobtail: 1
US - American Curl: 1
US - American Shorthair: 1
US - American Wirehair: 1
AE - Arabian Mau: 1
AU - Australian Mist: 1
US - Balinese: 1
US - Bambino: 1
US - Bengal: 1
FR - Birman: 1
US - Bombay: 1
GB - British Longhair: 1
GB - British Shorthair: 1
MM - Burmese: 1
GB - Burmilla: 1
US - California Spangled: 1
US - Chantilly-Tiffany: 1
FR - Chartreux: 1
EG - Chausie: 1
US - Cheetoh: 1
US - Colorpoint Shorthair: 1
GB - Cornish Rex: 1
CA - Cymric: 1
CY - Cyprus: 1
GB - Devon Rex: 1
RU - Donskoy: 1
CN - Dragon Li: 1
EG - Egyptian Mau: 1
MM - European Burmese: 1
US - Exotic Shorthair: 1
GB - Havana Brown: 1
US - Him

**3. Read the [countries API](https://restcountries.eu/rest/v2/all) and find**
   1. the 10 largest countries
   2. the 10 most spoken languages
   3. the total number of languages in the countries API

In [10]:
import requests

# API endpoint for countries
countries_api = 'https://restcountries.com/v2/all'

# Fetching data from the API
response = requests.get(countries_api)

if response.status_code == 200:
    countries_data = response.json()
    
    # Finding the 10 largest countries by area
    largest_countries = sorted(countries_data, key=lambda x: x.get('area', 0), reverse=True)[:10]
    
    # Finding the 10 most spoken languages
    all_languages = []
    for country in countries_data:
        all_languages.extend(country.get('languages', []))
    spoken_languages_count = {language['name']: all_languages.count(language) for language in all_languages}
    top_spoken_languages = sorted(spoken_languages_count.items(), key=lambda x: x[1], reverse=True)[:10]
    
    # Finding the total number of unique languages in the API
    unique_languages = set()
    for country in countries_data:
        unique_languages.update([lang['name'] for lang in country.get('languages', [])])
    total_languages = len(unique_languages)
    
    # Displaying the results
    print("The 10 largest countries by area:")
    for country in largest_countries:
        print(f"{country['name']}: {country.get('area')} square kilometers")
    
    print("\nThe 10 most spoken languages:")
    for language, count in top_spoken_languages:
        print(f"{language}: {count} countries")
    
    print(f"\nThe total number of languages in the countries API: {total_languages}")
else:
    print("Failed to fetch data from the Countries API")


The 10 largest countries by area:
Russian Federation: 17124442.0 square kilometers
Antarctica: 14000000.0 square kilometers
Canada: 9984670.0 square kilometers
China: 9640011.0 square kilometers
United States of America: 9629091.0 square kilometers
Brazil: 8515767.0 square kilometers
Australia: 7692024.0 square kilometers
India: 3287590.0 square kilometers
Argentina: 2780400.0 square kilometers
Kazakhstan: 2724900.0 square kilometers

The 10 most spoken languages:
English: 91 countries
French: 44 countries
Arabic: 25 countries
Spanish: 24 countries
Portuguese: 10 countries
Russian: 8 countries
Dutch: 8 countries
German: 7 countries
Chinese: 5 countries
Serbian: 4 countries

The total number of languages in the countries API: 123


__4. UCI is one of the most common places to get data sets for data science and machine learning. Read the content of UCL (https://archive.ics.uci.edu/ml/datasets.php). Without additional libraries it will be difficult, so you may try it with BeautifulSoup4__

In [11]:
import requests
from bs4 import BeautifulSoup

# URL of UCI Machine Learning Repository
uci_url = 'https://archive.ics.uci.edu/datasets?skip=0&take=10&sort=desc&orderBy=NumHits&search=ml'

# Fetching the content of the page
response = requests.get(uci_url)

if response.status_code == 200:
    # Parsing the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Finding and displaying the text content
    page_content = soup.get_text()
    print(page_content)
else:
    print("Failed to fetch the content from the UCI website")
    print("Status code is:", response.status_code)













UCI Machine Learning Repository

Datasets - UCI Machine Learning Repository




       Datasets Contribute Dataset Donate New Link External About Us Who We Are Citation Metadata Contact Information           Login    Filters            Keywords     Data Type      Subject Area      Task      # Features      # Instances      Feature Type      Python    Browse Datasets   Filters  Sort by # Views, desc # Views   Name  # Instances  # Features  Date Donated  Relevance        Expand All Collapse All    Internet Advertisements This dataset represents a set of possible advertisements on Internet pages.  Classification  Multivariate  3.28K Instances  1.56K Features      Heart failure clinical records This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features.  Classification, Regression, Clustering  Multivariate  299 Instances  12 Features      Iranian Churn Dataset This 