# Workshop - in class activity

## Q1 Web scrapping

For this question, follow the below mentioned instructions clearly:

+ Make an HTTP GET request to the Colorado.edu News website (https://www.colorado.edu/today/news-headlines)
+ Use Beautiful Soup to parse the HTML content.
+ Locate and extract article titles (tips: **div** tags with the class **"article-view-mode-sidebar-content"**.)
+ Store them in a list, including: Title, Link and plain (html content) of each news.
+ Finally, create a DataFrame from the list of titles and display it and save it a CSV file.

### Step 1. Import libraries

In [26]:
!pip install requests==2.32.3
!pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen



### Step 2. Send an HTTP GET request to the website - https://www.colorado.edu/today/news-headlines

In [28]:
url = "https://www.colorado.edu/today/news-headlines"
urlclient = urlopen(url)
webpage = urlclient.read()
# Your Code Here

### Step3. Parse the HTML content of the page with Beautiful Soup

In [30]:
# Your Code Here
cutoday_html = bs(webpage, 'html.parser')

### Step 4. Find the elements containing article titles and links. Save it as a list of elements (HINT - use class='article-view-mode-sidebar-content')

In [32]:
# Your Code Here
article = cutoday_html.find_all("div",{"class":"article-view-mode-sidebar-content"})
article

[<div class="article-view-mode-sidebar-content node-view-mode-sidebar-content">
 <a href="/today/2024/09/12/discovery-could-lead-longer-lasting-ev-batteries-hasten-energy-transition">Discovery could lead to longer-lasting EV batteries, hasten energy transition</a>
 </div>,
 <div class="article-view-mode-sidebar-content node-view-mode-sidebar-content">
 <a href="/today/2024/09/11/wildfire-smoke-exposure-boosts-risk-mental-illness-youth">Wildfire smoke exposure boosts risk of mental illness in youth</a>
 </div>,
 <div class="article-view-mode-sidebar-content node-view-mode-sidebar-content">
 <a href="/today/2024/09/11/increased-krill-fishing-threatens-whale-comeback">Increased krill fishing threatens whale comeback</a>
 </div>,
 <div class="article-view-mode-sidebar-content node-view-mode-sidebar-content">
 <a href="/today/2024/09/10/lemur-csi-researchers-id-predators-threatening-madagascars-iconic-primates">Lemur CSI: Researchers ID predators threatening Madagascar’s iconic primates</a>

### Step 5. Initialize empty lists to store data (titles, links, contet)

In [42]:
# Your Code Here
titles = []
links = []
content = []

for article_details in article:
    title = article_details.find("a").get_text() 
    link = article_details.find("a")['href']
    titles.append(title)
    links.append(link)
    content.append(article_details)

### Step 6. Extract data from the scraped elements.

In [44]:
# Your Code Here
base_url = "https://www.colorado.edu/"
modified_links = [base_url + link for link in links]
print(modified_links)

['https://www.colorado.edu//today/2024/09/12/discovery-could-lead-longer-lasting-ev-batteries-hasten-energy-transition', 'https://www.colorado.edu//today/2024/09/11/wildfire-smoke-exposure-boosts-risk-mental-illness-youth', 'https://www.colorado.edu//today/2024/09/11/increased-krill-fishing-threatens-whale-comeback', 'https://www.colorado.edu//today/2024/09/10/lemur-csi-researchers-id-predators-threatening-madagascars-iconic-primates', 'https://www.colorado.edu//today/cu-boulder-agu-2023', 'https://www.colorado.edu//today/2023/11/28/cu-boulder-cop28-addressing-climate-change-through-innovation', 'https://www.colorado.edu//today/2023/10/10/conflict-middle-east-campus-resources-insights-and-more', 'https://www.colorado.edu//today/gun-violence', 'https://www.colorado.edu//today/2022/quantum-revolution']


### Step 7. Create a DataFrame

In [46]:
# Your Code Here
article_details = {'Titles': titles, 'Links': modified_links, 'Content': content}
article_details

{'Titles': ['Discovery could lead to longer-lasting EV batteries, hasten energy transition',
  'Wildfire smoke exposure boosts risk of mental illness in youth',
  'Increased krill fishing threatens whale comeback',
  'Lemur CSI: Researchers ID predators threatening Madagascar’s iconic primates',
  'CU Boulder at AGU 2023: From Earth to space',
  'CU Boulder at COP28: Addressing climate change through innovation',
  'Conflict in the Middle East: Campus resources, insights and more',
  'Gun violence and public health',
  "Colorado's quantum revolution turning state into new Silicon Valley"],
 'Links': ['https://www.colorado.edu//today/2024/09/12/discovery-could-lead-longer-lasting-ev-batteries-hasten-energy-transition',
  'https://www.colorado.edu//today/2024/09/11/wildfire-smoke-exposure-boosts-risk-mental-illness-youth',
  'https://www.colorado.edu//today/2024/09/11/increased-krill-fishing-threatens-whale-comeback',
  'https://www.colorado.edu//today/2024/09/10/lemur-csi-researchers-id

Optional, load content for each page here...

In [48]:
# Your Code Here
import pandas as pd
df_article = pd.DataFrame(article_details)
df_article


Unnamed: 0,Titles,Links,Content
0,Discovery could lead to longer-lasting EV batt...,https://www.colorado.edu//today/2024/09/12/dis...,"[\n, [Discovery could lead to longer-lasting E..."
1,Wildfire smoke exposure boosts risk of mental ...,https://www.colorado.edu//today/2024/09/11/wil...,"[\n, [Wildfire smoke exposure boosts risk of m..."
2,Increased krill fishing threatens whale comeback,https://www.colorado.edu//today/2024/09/11/inc...,"[\n, [Increased krill fishing threatens whale ..."
3,Lemur CSI: Researchers ID predators threatenin...,https://www.colorado.edu//today/2024/09/10/lem...,"[\n, [Lemur CSI: Researchers ID predators thre..."
4,CU Boulder at AGU 2023: From Earth to space,https://www.colorado.edu//today/cu-boulder-agu...,"[\n, [CU Boulder at AGU 2023: From Earth to sp..."
5,CU Boulder at COP28: Addressing climate change...,https://www.colorado.edu//today/2023/11/28/cu-...,"[\n, [CU Boulder at COP28: Addressing climate ..."
6,"Conflict in the Middle East: Campus resources,...",https://www.colorado.edu//today/2023/10/10/con...,"[\n, [Conflict in the Middle East: Campus reso..."
7,Gun violence and public health,https://www.colorado.edu//today/gun-violence,"[\n, [Gun violence and public health], \n]"
8,Colorado's quantum revolution turning state in...,https://www.colorado.edu//today/2022/quantum-r...,"[\n, [Colorado's quantum revolution turning st..."


### Step 8. Save CSV file

In [52]:
# Your Code Here
df_article.to_csv('CUBoulderArticles.csv',index=False)

## Q2 Data Cleaning - Workshop

###This workshop involves cleaning of a financial dataset, which entails credit card transaction details.



Download the dataset called *transactions.csv* from Canvas. Here's a description of the columns in the dataset:

+ `transactionDateTime`: The date and time of the transaction.
+ `transactionAmount`: The amount of the transaction.
+ `merchantName`: The name of the merchant where the transaction took place.
+ `acqCountry`: The country where the transaction was acquired.
+ `merchantCategoryCode`: The category or type of the merchant.
+ `currentExpDate`: The current expiration date of the credit card.
+ `accountOpenDate`: The account open date.
+ `cardCVV`: The CVV associated with the credit card.
+ `enteredCVV`: The CVV entered during the transaction.
+ `cardLast4Digits`: The last four digits of the credit card number.
+ `transactionType`: The type of transaction.
+ `cardPresent`: Indicates whether the card was present during the transaction.

### Step 1. Import the necessary libraries

In [None]:
# Your Code Here
import pandas as pd



### Step 2. Download the dataset, load it (Google Colab), and save it as a DataFrame.

In [None]:
# Your Code Here


### For this next part, there are certain columns which need cleaning. Details for such columns are provided along with the task which needs to be done

### Step3. Capitalize **all column names** for the given dataset

In [None]:
# Your Code Here

### Separate the date and time - create new columns- **transactionDateTime**

In [None]:
# Your Code Here

### Missing values / Error values

> Check **transactionAmount** for negative values and zeros. If they exist, saved them in a local variable called err

In [None]:
# Your Code Here

Strategies to replaces missing values (TRANSACTIONAMOUNT)
* Strategy 1. Removed them from the list. Tips: use **dropna** and create a new Dataframe Q2_st1
* Strategy 2. Replace for the mean. Tips: use **fillna** and create a new Dataframe Q2_st2

#### Which one is better?, why?

In [None]:
#Strategy 1
# Your Code Here


In [None]:
#Strategy 2
# Your Code Here

