## Social Computing: Notebook 3

Please include your names below and edit the name of the file to include the last names of the people answering

In [1]:
# Students: ..., ...

# Part 1: Feature extraction (30 points total)

For this exercise you are provided with a dataset of hashed email addresses. This dataset can be found in hashedEmailAddressesGitHubCommits.csv, at https://www.dropbox.com/s/kevcp917ok9qyn0/hashedEmailAddressesGitHubCommits.csv.zip?dl=0. Please make sure you can access this file.

## Identifying the country of residence

**Exercise 1.1 (12 points)**

Based on the email address, we can infer different properties about the user. One example is their country of residence. For example, an email like “1315ae6229444367968a943a219f38def9a8112d@vpn-251-169.epfl.ch” will probably correspond to a user located in Switzerland, while an email of the form “59bd0a3ff43b32849b319e645d4798d8a5d1e889@philipphauer.de” will probably correspond to a user located in Germany. 

Write a script that classifies the email addresses from the file above, based on the country of residence.

In [16]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

data = pd.read_csv('hashedEmailAddressesGitHubCommits.csv', header=None)
data.columns = ['email']

url = "https://www.worldstandards.eu/other/tlds/"
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
country = [x.text for x in soup.find_all('td', class_='column-1')]
tld = [x.text for x in soup.find_all('td', class_='column-2')]


def get_country_from_email(df):
    
    list_to_save = []
    
    for index, row in df.iterrows():
        try:
            id_list = row['email'].split("@")
            id_string = id_list[1]
            identifier = id_string.split(".")
            key = '.'+identifier[len(identifier)-1]
            if key in tld:
                i = tld.index(key)
                list_to_save.append([row['email'], key, country[i]])
            else:
                list_to_save.append([row['email'], key, 'unknown'])
        except:
            list_to_save.append([row['email'], '', ''])                    
              
    return pd.DataFrame(list_to_save, columns=['email', 'identifier', 'country'])

results = get_country_from_email(data)
results

Unnamed: 0,email,identifier,country
0,9fd8de5fc2a7c2c0d469b2fff1afde4e5def37ba@georg...,.com,unknown
1,229de65f28dfbfbe8eeed5e512ab772e573e891a@gmail...,.com,unknown
2,869a92ca579ecfd79c8d4f27a1fc29f47d2b6b84@kitco...,.com,unknown
3,f0ed5cc04491838daa1cd1ee171dabd3e0309500@cpc.a...,.jp,Japan
4,1a66a945a7c848e2e3f1cd5f9692be1f16a0a484@users...,.com,unknown
...,...,...,...
2360775,cbe7447ad97f27864fa4162e1cf8a37f986c21f6@gmail...,.com,unknown
2360776,a3fd0d91f946bcb8d4c687d19402aed11176cb1b@gmail...,.com,unknown
2360777,f65118d57734745c2604933634602b8ae81d256f@gmail...,.com,unknown
2360778,a6daaf7a71ffb094243884e0d9e615984dde4618@gmail...,.com,unknown


**Exercise 1.2 (6 points)**

Select 100 email addresses from the list. Each member of the team should manually evaluate them (separately).

Next, compare your results. Did you agree on all the 100 entries? Were you able to label all email addresses?

In [None]:
# A list of 100 emails classified manually by both students with appropriate analysis for full points

**Exercise 1.3 (9 points)**

Now, evaluate the coverage and the accuracy of your algorithm by comparing your labels with the algorithm's result. 
 -  Based on the random sample, what % of the data can your algorithm classify? Can you add something to your code to increase it? If, yes, include below your updated code. How much did coverage increase?
 - Based on the classified items on your list, what % did your algorithm classify correctly?

In [None]:
# Correct answer plus improved code. For example, classifying based on educational institution in addition to just country level domain is good enough. If the code for 1.1 was already perfect, then get the points anyway.

**Exercise 1.4 (3 points)**
The country of residence is not the only characteristic of the user that can be inferred from their email address. Make a short list of other characteristics that you could infer from the data.

Examples responses are:
* organization type (commercial / non-profit)
* organization name
* noreply or not
* university
* e-mail provider company
* device if .local

# Part 2: Rate limiting (15 points total)

**Exercise 2.1 (2 points)**

By now, you are familiar with a few APIs, namely Google Books and NYT. For both of them, find the rules about rate limits and summarise them below.

Google Books: 
Per-site/Per-user quota: 50 Q/sec, 1200 Q/min
Per project quota: 100'000'000 Q/day
    
NYT: 
4'000 requests per day
10 requests per minute

**Exercise 2.2 (8 points)**

Next, pick one of these two APIs and try to exceed the rate limit. What reaction do you get from the API? Comment your code to make your logic clear!

In [17]:
## code to exceed rate limit
import requests

#we send multiple requests to the nyt api
ny = "https://api.nytimes.com/svc/search/v2/articlesearch.json?q=corona&api-key=31lGHgX5YdO9IN07sVxhdPMapiZXgxT7"
i=0
while (i<15):
    response = requests.get(ny)
    print(response)
    print('Request number ', i)
    i=i+1
#we send 11 requests in one minute: Now we get a response status 429 (Too many requests) instead of 200 (Ok)

<Response [200]>
Request number  0
<Response [200]>
Request number  1
<Response [200]>
Request number  2
<Response [200]>
Request number  3
<Response [200]>
Request number  4
<Response [200]>
Request number  5
<Response [200]>
Request number  6
<Response [200]>
Request number  7
<Response [200]>
Request number  8
<Response [200]>
Request number  9
<Response [429]>
Request number  10
<Response [429]>
Request number  11
<Response [429]>
Request number  12
<Response [429]>
Request number  13
<Response [429]>
Request number  14


**Exercise 2.3 (5 points)**

In the next problem you will check how many requests you can send to Google Search before getting blocked. Websites protect themselves from automated crawling by checking requests that come from the same computer in a small time frame and after a while, they won't respond to the request. A valid response would be "Response 200", which you can see if you just print the response of `requests.get('https://www.google.com/search?q=zurich')`. 

Write code to find out how many requests does it take to get blocked (when you first get a response other than 200). In addition, write what is the number of a blocked response and what does it stand for (Google response XXX)?

In [19]:
number_of_calls = 0

_SESSION = requests.Session()

while _SESSION.get('http://www.google.com/search?q=zurich').status_code == 200:
    #print(number_of_calls)
    number_of_calls += 1

print(_SESSION.get('http://www.google.com/search?q=zurich'))
print(number_of_calls)


<Response [429]>
0


Answer: Google banned us after 68 calls to the API, with a Response 429 - Too Many Requests

## Part 3: Selenium (55 points total)

## Before you start: Selenium Download
For the next exercises you will have to download selenium. 

You can read more about the webdriver here (https://chromedriver.chromium.org), but if you want to go straight to the download, go to https://chromedriver.storage.googleapis.com/index.html?path=89.0.4389.23/ and download your version. 

Moreover, in your terminal type `pip install selenium`. 

Once this is done, you should be able to run:
- `from selenium import webdriver`
- `browser = webdriver.Chrome([the path where you put the googlechromedriver])`

In case of any issues, the https://chromedriver.chromium.org website has some straightforward info on common bugs. 


## Selenium Sessions

**Exercise 3.1 (25 points)**

Go to a website of your choice where you have an account. It can for example be the New York Times APi website where you created a login last time but also tutti.ch, comparis, whatever simple website you often use.

Using Selenium create a session where you 
1. go to the main website 
2. log in 
3. click on an element of your choice 
4. scroll to the bottom of the page
5. then save the page. 

When logging in, you will have to find the name of the login form and submit your credentials to it and then click the login button. Here you find an example for a login using selenium but in case you decide to use this help, Facebook should not be your chosen website. https://crossbrowsertesting.com/blog/test-automation/automate-login-with-selenium/
 
Tip: Website uses captcha? You can put your script to sleep for some number of seconds by using time.sleep() function and enter captcha manually.

In [23]:
from selenium import webdriver
import time
browser = webdriver.Chrome(".../chromedriver")

browser.get("https://gab.com/auth/sign_in")
username = "..."
password = ".."

time.sleep(3)

#log in
browser.find_element_by_id("user_email").send_keys(username)
browser.find_element_by_id("user_password").send_keys(password)

#click
browser.find_element_by_class_name("actions").click()

#this allows the site to load
time.sleep(5)

#scrolls down
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") 

#saves page
with open('gab_page.html', 'w') as f:
    f.write(browser.page_source)

  This is separate from the ipykernel package so we can avoid doing imports until


SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 99
Current browser version is 101.0.4951.64 with binary path /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
Stacktrace:
0   chromedriver                        0x0000000103dd8159 chromedriver + 5120345
1   chromedriver                        0x0000000103d65b13 chromedriver + 4651795
2   chromedriver                        0x0000000103955e68 chromedriver + 392808
3   chromedriver                        0x000000010397c1e6 chromedriver + 549350
4   chromedriver                        0x0000000103977c72 chromedriver + 531570
5   chromedriver                        0x00000001039746ed chromedriver + 517869
6   chromedriver                        0x00000001039ae825 chromedriver + 755749
7   chromedriver                        0x00000001039a8a33 chromedriver + 731699
8   chromedriver                        0x000000010397e5dd chromedriver + 558557
9   chromedriver                        0x000000010397f4f5 chromedriver + 562421
10  chromedriver                        0x0000000103d9538d chromedriver + 4846477
11  chromedriver                        0x0000000103daf21c chromedriver + 4952604
12  chromedriver                        0x0000000103db4a12 chromedriver + 4975122
13  chromedriver                        0x0000000103dafb4a chromedriver + 4954954
14  chromedriver                        0x0000000103d8a5b0 chromedriver + 4801968
15  chromedriver                        0x0000000103dc9f78 chromedriver + 5062520
16  chromedriver                        0x0000000103dca0ff chromedriver + 5062911
17  chromedriver                        0x0000000103ddf545 chromedriver + 5150021
18  libsystem_pthread.dylib             0x00007ff801ce2514 _pthread_start + 125
19  libsystem_pthread.dylib             0x00007ff801cde02f thread_start + 15


## Measuring personalization

**Exercise 3.2 (30 points)**

In this exercise you will have to imitate the study described in class on a website of your interest. You will have to measure differences in the content that you receive back from the website under varying treatments. 

You will have to choose a website and a treatment. Use selenium for this exercise as well. 
- As for websites, you can pick an online store, or traveling site, some news site, Google News.. basically try to pick something that you suspect gives different results for different searchers. 
- Examples for treatments would be location, being logged in with an account, history with the website, being on a phone vs a desktop, etc. 
- You can try to pick multiple searches to make sure you are measuring real phenomenon, not only noise
- You can include a control treatment in case you suspect there's A/B testing or noise in how the pages look
- Finally you have to pick a measure for the differences on the page. In case you receive items on a page, for example URLs or products, you can define an overlap metric. In case the page is more unstructured, come up with an explanation for how you define differences.

As your answer, explain which of the above you chose, how you implemented the experiment, and what difference you found in the pages you collected. 

You can find more info on how to run multiple browsers at the same time here: https://crossbrowsertesting.com/blog/selenium/run-test-multiple-browsers-parallel-selenium/

In [None]:
import selenium
from bs4 import BeautifulSoup

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("user-data-dir=/...")

chrome = webdriver.Chrome(".../chromedriver", chrome_options=chrome_options)

time.sleep(3)

chrome.get("https://www.youtube.com/")
chrome_page = chrome.page_source
chrome_html = BeautifulSoup(chrome_page, "html.parser")
chrome_tags = chrome_html.find_all("yt-formatted-string", id="video-title")
chrome_titles = []
for tag in chrome_tags:
    chrome_titles.append(tag.text)

# time to log in to youtube
time.sleep(30)

chrome.get("https://www.youtube.com/")
chrome_page_2 = chrome.page_source
chrome_html_2 = BeautifulSoup(chrome_page_2, "html.parser")
chrome_tags_2 = chrome_html_2.find_all("yt-formatted-string", id="video-title")
chrome_titles_2 = []
for tag in chrome_tags_2:
    chrome_titles_2.append(tag.text)


total = len(chrome_titles) + len(chrome_titles_2)
print(len(chrome_titles))
print(len(chrome_titles_2))

matching_count = 0
for title in chrome_titles:
    if title in chrome_titles_2:
        matching_count += 1

print("The matching quota is: {}".format(matching_count/total*100))