# Lab 8
---
Hello and welcome to Lab 8.

__Guidelines:__

- Please write and submit the two programs below by the deadline: Monday, March 27, at 6:00pm Pacific time

- You must complete the assignments individually. If you have trouble completing the assignment, please let one of the teaching assistants (TAs) know, during the lab or their office hours. They will help and guide you, but they will not write code for you and no one else should :) !!!  

- You have to fill in the code in this notebook and upload it back to Blackboard for submission. Please remember to rename your file as "Lab8_[YOUR FIRSTNAME]_[YOUR LASTNAME].ipynb" (e.g. Lab8_George_Washington.ipynb).

- You may look up resources online like python docs and stackoverflow. You may look up topics, but not the questions themselves.

- You can submit only one time. Your grade will be based on this submission.

# Q1. [10 points] 
---
### API: Gender prediction

Write a function _gender_based_on_name()_ that will predict whether a given name is male or female.<br>
Use an API.<br>
See example presented in class on March 22, 2023.<br>
Call ```strip()``` over the name for added robustness.  

In [17]:
from typing import Optional
import requests

def gender_based_on_name(name: str) -> Optional[str]:
    input_name = name.strip()
    url = f'https://api.genderize.io?name={input_name}'
    response = requests.get(url)
    data = response.json()
    output = data['gender']
    return output

In [18]:
# open test

print(gender_based_on_name('Mary'))         # should print 'female'
print(gender_based_on_name('John'))         # should print 'male'
print(gender_based_on_name('NoSuchName'))   # should print None

female
male
None


# Q2 [10 points]

### API: Airport weather

For this question, we will be using requests library to request
```api.weather.gov``` to get an airport's weather information. Based on that, you have to determine whether or not the weather at the airport is cloudy or not.  

*How to decide whether the airport is cloudy or not?*  
We check if the ```shortForecast``` field has the word ___cloudy___ for the next time period. Forecasts such as _mostly cloudy_ or _partly cloudy_ count as cloudy. It will be a two-step process to obtain the value for ```sortForecast```: the first API request will include a URL that, when called, contains the ```shortForecast``` that we are after.

Input: String(Airport name)  
Output: Boolean(True if cloudy, False otherwise)  

FAQs:  
Q. What link should I use for the requests.get() function?  
A. Link would look something like this -> https://api.weather.gov/points/ ```<Latitude value>,<Longitude value>```   
An Example would be: https://api.weather.gov/points/39.7456,-97.0892

Q. I got something after requesting, but, I am not sure, what it is.  
A. You received a response object. You can call .json() to it and see what is there. (See ```json.dumps()```)  

Q. How do I get information from this json?  
A. For getting info from that, as taught in the class, you can simply index them by using keys that you want after assessing the json.  

Q. Okay, I indexed the json using the keys and I get some list out of it. What is that?  
A. That list is the forecast for the next hours. You'll need this information for making the decision that is required in the question. Also, what you are getting in the list is a dict. **Make sure you understand what you are dealing with at each point.**  

In [20]:
def get_airport_info() -> dict:
    """This function builds a dictionary mapping airport names to the latitude and longitude of the airports, 
    based on file Airports.txt that has been provided to you."""
    airports = dict()
    with open('Airports.txt', encoding='utf8') as file:
        for line in file:
            airport_name, coordinates = line.split('\t')[1], line.split('\t')[3].split(',')
            long, lat = float(coordinates[0][1:]), float(coordinates[1][1:-2])
            airports[airport_name] = (lat, long)
    return airports

In [21]:
import requests
import json
from typing import Optional

def cloudy_airport(airport_name: str) -> Optional[bool]:
    ## access to the dictionary
    airport_dict = get_airport_info()
    ## get the coordinates and round it as the regular format in a website link
    coordinates = list(airport_dict[airport_name])
    long = round(float(coordinates[0]),4)
    lat = round(float(coordinates[1]),4)
    url_airport = f'https://api.weather.gov/points/{long},{lat}'
    ## access to the website
    repsonse_airport = requests.get(url_airport)
    data_airport = repsonse_airport.json()
    ## first go to the forecast link
    url_weather = data_airport['properties']['forecastHourly']
    response_weather = requests.get(url_weather)
    data_weather = response_weather.json()
    ## find the next-hour weather
    next_hour_weather = data_weather['properties']['periods'][0]
    weather_reported = next_hour_weather['shortForecast']
    if "cloudy" in weather_reported:
        return True
    else:
        return False

In [22]:
# open test

airports = ["Los Angeles County Sheriff's Department Heliport"]
for airport in airports:
    cloudy = cloudy_airport(airport)
    if cloudy is None:
        print(f"No info available for {airport}.")
    elif cloudy:
        print(f"{airport} is cloudy.")
    else:
        print(f"{airport} is not cloudy.")

Los Angeles County Sheriff's Department Heliport is not cloudy.


# Q3 [10 points]

### Webscraping: Newspaper

For this question, we will scrape the website "https://www.dailynews.com".  

For a given keyword, e.g. _COVID_, return all news headlines (and their links) on the website above that contain the keyword in the headline. Specifically, return a list of tuples where the first component is the link to the article and where the second component is the headline text.

* Searches for the keyword in a headlines should be case-insensitive, e.g. when searching for _marathon_, you should return headlines containing _Marathon_ (capitalized).
* Call ```strip()``` over the link and the headline text, so that it is more readable.  
* Consider using BeautifulSoup and its .prettify() feature, which might better guide your web-scraping.

In [29]:
from bs4 import BeautifulSoup
import requests
from typing import List, Tuple

def get_news_of(keyword: str) -> List[Tuple[str, str]]:
    
    ## make a list for news
    news_list = []

    url_news = "https://www.dailynews.com/"
    response_news = requests.get(url_news)
    data_news = BeautifulSoup(response_news.text, 'html.parser')
    ## exclude the trending
    news_content = data_news.find('div',{'class':'content-area'})
    ## only the content area
    article = news_content.find_all('a',{'class':'article-title'})
    for news in article:
        if news['title'] and news['href']:
            ## check whether the headline contains the keyword
            headline = news['title'].strip()
            if keyword.lower() in headline.lower():
                link = news['href'].strip()
                ## a duplicated headline may exist
                if tuple((link, headline)) not in news_list:
                    news_list.append(tuple((link, headline)))
    return news_list

In [30]:
# open test
print(get_news_of('California'))
print(get_news_of('Covid'))

[('https://www.dailynews.com/2023/03/26/indigenous-tribes-work-with-swedish-and-csun-scholars-to-thrive-in-california/', 'Indigenous tribes work with Swedish and CSUN scholars to thrive in California'), ('https://www.dailynews.com/2023/03/25/what-you-need-to-know-about-growing-berries-in-southern-california/', 'What you need to know about growing berries in Southern California'), ('https://www.dailynews.com/2023/03/24/in-wake-of-southern-californias-wet-winter-potholes-pose-a-perilous-problem/', 'In wake of Southern California’s wet winter, potholes pose a perilous problem'), ('https://www.dailynews.com/2023/03/14/another-storm-hits-southern-california-with-more-rain-possible-over-the-weekend/', 'Another storm hits Southern California, with more rain possible over the weekend'), ('https://www.dailynews.com/2023/02/27/southern-californias-mountain-towns-remain-buried-under-snow-with-more-on-the-way/', 'Southern California’s mountain towns remain buried under snow with more on the way'),

# Bonus question [5 points]

### Regex: Data cleaning

Check this out:

In [25]:
s1, s2 = "Alexa", "Аlexa"
s1 == s2

False

Surprise? &nbsp; Let's investigate:

In [26]:
import unicodedata as ud

def print_unicode_names_of_letters(s: str) -> None:
    for letter in s:
        print(f"{letter}  U+{ord(letter):04X}  {ud.name(letter)}")

print_unicode_names_of_letters(s1 + s2)

A  U+0041  LATIN CAPITAL LETTER A
l  U+006C  LATIN SMALL LETTER L
e  U+0065  LATIN SMALL LETTER E
x  U+0078  LATIN SMALL LETTER X
a  U+0061  LATIN SMALL LETTER A
А  U+0410  CYRILLIC CAPITAL LETTER A
l  U+006C  LATIN SMALL LETTER L
e  U+0065  LATIN SMALL LETTER E
x  U+0078  LATIN SMALL LETTER X
a  U+0061  LATIN SMALL LETTER A


So it turns out there are look-alike characters between various scripts.<p>
__Task:__ Write a function _clean_latin_text_ that checks and cleans a line of text as follows.
If a word inside the line contains both Latin and Cyrillic characters and all Cyrillic characters in that word have look-alike Latin characters, as defined in the _cyr2lat_ dictionary below, map all Cyrillic characters in that word to their Latin counterparts. If a word contains no Latin characters, or if some Cyrillic characters do not have any Latin counterparts, don't change any part of the word, as it is not clear that it was really meant to be a Latin-script word.

if both latin and cyrillic inside, and all cyrillic look-like latin -- output latin
if no latin or if the cyrillic not look-like lation -- original word
cyrillic: ord between U+0400 to U+04FF
latin 0041 to 024F


In [27]:
import regex

cyr2lat = {'Ѕ':'S', 'А':'A', 'В':'B', 'Е':'E', 'К':'K', 'М':'M', 'Н':'H', 'О':'O', 'Р':'P', 
           'С':'C', 'Т':'T', 'Х':'X', 'Ԛ':'Q', 'Ԝ':'W', 'а':'a', 'е':'e', 'о':'o', 'р':'p', 
           'с':'c', 'у':'y', 'х':'x', 'ѕ':'s', 'і':'i', 'ј':'j', 'ԛ':'q', 'ԝ':'w'}

def clean_latin_text(s: str) -> str:
    word_list = s.split()
    cleaned_text = []
    cleaned_word = []
    ## check whether a word contains both Latin and Cyrillic
    pattern = r'(\p{Latin}.*\p{Cyrillic}|\p{Cyrillic}.*\p{Latin})'
    for word in word_list:
        if regex.search(pattern, word):
            ## change the cyrillic word if look-like Latin
            for letter in word:
                if letter in cyr2lat.keys():
                    letter = cyr2lat[letter]
                    cleaned_word.append(letter)
                ## else, keep the original
                else:
                    letter = letter
                    cleaned_word.append(letter)
            ## join the letter together
            word = ''.join(cleaned_word)
        else:
            word = word
        cleaned_text.append(word)
    output = ' '.join(cleaned_text)
    return output

In [28]:
# open test

print(clean_latin_text("Аlехаndеr; слава і воля."))
print_unicode_names_of_letters(clean_latin_text("Аlехаndеr; слава і воля."))
# Should yield: Alexander; слава і воля.  (Use print_unicode_names_of_letters() as needed.)

Alexander; слава і воля.
A  U+0041  LATIN CAPITAL LETTER A
l  U+006C  LATIN SMALL LETTER L
e  U+0065  LATIN SMALL LETTER E
x  U+0078  LATIN SMALL LETTER X
a  U+0061  LATIN SMALL LETTER A
n  U+006E  LATIN SMALL LETTER N
d  U+0064  LATIN SMALL LETTER D
e  U+0065  LATIN SMALL LETTER E
r  U+0072  LATIN SMALL LETTER R
;  U+003B  SEMICOLON
   U+0020  SPACE
с  U+0441  CYRILLIC SMALL LETTER ES
л  U+043B  CYRILLIC SMALL LETTER EL
а  U+0430  CYRILLIC SMALL LETTER A
в  U+0432  CYRILLIC SMALL LETTER VE
а  U+0430  CYRILLIC SMALL LETTER A
   U+0020  SPACE
і  U+0456  CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
   U+0020  SPACE
в  U+0432  CYRILLIC SMALL LETTER VE
о  U+043E  CYRILLIC SMALL LETTER O
л  U+043B  CYRILLIC SMALL LETTER EL
я  U+044F  CYRILLIC SMALL LETTER YA
.  U+002E  FULL STOP
