## The purpose of this project is to extract data from multiple sources, transform it, and then load it into a data storage solution. The targeted subject is "A.I News".

## 1 - Data ingestion:

## ● csv extraction:

### The csv file used contains relevant and irrelevant data to A.I, we're going to extract the relevant data

#### Reading csv file:

In [1]:
import pandas as pd
import numpy as np
import re

#Reading the csv file
csv_file = pd.read_csv(r"C:/Users/osama/Desktop/Third year - Second semester/Data Engineering/Project/Project Files/articulos_ml.csv")
csv_file = pd.DataFrame(csv_file)
csv_file

Unnamed: 0,Title,url,Word count,# of Links,# of comments,# Images video,Elapsed days,# Shares
0,What is Machine Learning and how do we use it ...,https://blog.signals.network/what-is-machine-l...,1888,1,2.0,2,34,200000
1,10 Companies Using Machine Learning in Cool Ways,,1742,9,,9,5,25000
2,How Artificial Intelligence Is Revolutionizing...,,962,6,0.0,1,10,42000
3,Dbrain and the Blockchain of Artificial Intell...,,1221,3,,2,68,200000
4,Nasa finds entire solar system filled with eig...,,2039,1,104.0,4,131,200000
...,...,...,...,...,...,...,...,...
156,[Log] 83: How Google Uses Machine Learning And...,[Log] 83: http://feedproxy.google.com/~r/Techc...,3239,3,11.0,1,84,3239
157,[Log] 84: Zuck Knows If You've Been Bad Or Goo...,[Log] 84: http://feedproxy.google.com/~r/Techc...,2566,3,8.0,4,85,25019
158,[Log] 85: Microsoft Improves Windows Phone Voi...,[Log] 85: http://feedproxy.google.com/~r/Techc...,2089,4,4.0,1,86,49614
159,[Log] 86: How Google's Acquisition Of DNNresea...,[Log] 86: http://feedproxy.google.com/~r/Techc...,1530,4,12.0,3,87,33660


In [2]:
#Getting column names from csv file for the new csv dataframe
csv_columns = csv_file.columns

#Making new dataframe
csv_df = pd.DataFrame(columns = csv_columns)
csv_df

Unnamed: 0,Title,url,Word count,# of Links,# of comments,# Images video,Elapsed days,# Shares


#### Cleaning and transforming data:

In [3]:
#If a title contains these words it is chosen for the new dataframe
matches = ["Artificial Intelligence", "AI", "A.I", "A.I."]

#This regular expression pattern to match "[Log]" followed by one or two numbers and remove it
#Note: (the csv file data titles all started with "[Log] ##:", for example, "[Log] 23:", so this pattern is for removing it)
pattern = r"\[Log\]\s*\d{1,2}:"

#Loop to check if the titles contain the matches and adds them to the new dataframe 
for index, row in csv_file.iterrows():                 #Iterate for each row 
    if any(x in row["Title"] for x in matches):        #search each row's title column for matches
        title = re.sub(pattern, "", str(row["Title"])) # Remove "[log]" along with subsequent numbers
        row["Title"] = title                           # Update the title in the row
        
        url = re.sub(pattern, "", str(row["url"]))  # Remove "[log]" along with subsequent numbers
        row["url"] = url                            # Update the url in the row
        
        csv_df = csv_df.append(row)
        
#Note: (when we turned the attributes "title" and "url" into strings, the "NaN" values turned into "nan")
#Converting all "nan" into actual missing values then dropping all missing values 
csv_df = csv_df.replace('nan', np.nan)
csv_df = csv_df.dropna()

# "# of comments" column was float64, I changed it to int to match the other numerical columns
csv_df['# of comments'] = csv_df['# of comments'].astype(int)

csv_df

Unnamed: 0,Title,url,Word count,# of Links,# of comments,# Images video,Elapsed days,# Shares
17,Who’s a good AI? Dog-based data creates a cani...,https://techcrunch.com/2018/04/11/whos-a-good-...,635,3,1,2,12,3200
44,Top 20 PythonAI and MachineLearning Open Sourc...,https://www.kdnuggets.com/2018/02/top-20-pytho...,1184,39,8,1,63,1300
73,Allegro.AI nabs $11M for ‘deep learning as a ...,https://techcrunch.com/2018/04/25/allegro-ai-...,1864,6,12,2,1,42406
75,UK report urges action to combat AI bias,https://techcrunch.com/2018/04/16/uk-report-u...,1741,5,10,3,3,35691
78,Arm chips with Nvidia AI could change the Int...,https://techcrunch.com/2018/03/27/arm-chips-w...,1864,1,10,4,6,30756
87,Frank Chen will make you a believer in AI,https://mixpanel.com/blog/2017/12/12/frank-ch...,1913,5,1,6,15,5261
99,Frank Lessons on AI from the Developer of the...,https://mxpnlcms.wpengine.com/blog/2017/08/31...,1007,2,7,6,27,10574
101,Prisma shifts focus to b2b with an API for AI...,https://techcrunch.com/2017/08/19/prisma-shif...,3019,5,3,2,29,13586
104,What you should know about AI,https://techcrunch.com/2017/08/01/what-you-sh...,2224,8,2,3,32,32248
105,HBO’s Silicon Valley on A.I. driven products:...,https://mxpnlcms.wpengine.com/blog/2017/07/27...,1778,4,11,6,33,8001


## ● web scraping extraction:

In [4]:
from bs4 import BeautifulSoup 
import requests

#AI news dataframe
AiNews = pd.DataFrame(columns = ['Title','Description','Date','Genre', 'url'])
AiNews

Unnamed: 0,Title,Description,Date,Genre,url


In [5]:
#Temporary containers for the attributes, once all data is collected they're added into the AiNews dataframe
AiNews_Title = []
AiNews_Description = []
AiNews_Date = []
AiNews_Genre = []
AiNews_url = []

#### Extraction functions:

In [6]:
#Creating a BeautifulSoup object to use in extraction
url = "https://www.artificialintelligence-news.com/"  #Website link
response = requests.get(url)                          #Get website
soup = BeautifulSoup(response.content, 'html.parser') #read website html content

def AiNews_Title_extraction():
    titles = soup.select('header.article-header') #Title data tag
    for x in titles:                              #Get all news titles
        text = x.get_text().strip()               #Returns the text as a string, without any tags or markup
        AiNews_Title.append(text)
        
        
def AiNews_Description_extraction():
    descriptions = soup.select('div.cell.small-12.medium-8.large-6') #Description data tag
    for x in descriptions:                                           #Get all news descriptions
        text = x.get_text().strip()                                  #Returns the text as a string, without any tags or markup
        AiNews_Description.append(text)
        
        
def AiNews_Date_and_AiNews_Genre_extraction(): 
    extracted = soup.select('div.byline')                        #Tag which contained both 'Date' and 'Genre' data
    for x in extracted:
        text = x.get_text().strip()                              #Returns the text as a string, without any tags or markup
        text = text.split('                    |\n            ') #Split based on the seperator between 'Date' and 'Genre'
        dates, genres = zip(*[text])                             #Storing 'Date' and 'Genre' data into different variables

        #Filter out the genre and date
        AiNews_Genre.extend(genres)
        AiNews_Date.extend(dates)

        #Some dates and genres were lists insides of the list extracted, this converts them all to strings
        filtered_genre_output = [item for item in AiNews_Genre if isinstance(item, str)]
        filtered_date_output = [item for item in AiNews_Date if isinstance(item, str)]
        
        
def AiNews_url_extraction():
    links = soup.select('header.article-header') #the header contains the link
    for x in links:
        link = x.find('a')                       #reaching the <a> tag to extract the link
        link = link['href']                      #extracting the url
        AiNews_url.append(link)

#### Refresh function:

In [7]:
def AiNews_Refresh():
    global AiNews
    #Append all extracted data to temporary containers
    AiNews_Title_extraction()
    AiNews_Description_extraction()
    AiNews_Date_and_AiNews_Genre_extraction()
    AiNews_url_extraction()

    #Convert temporary containers list into series as string
    title_series = pd.Series(AiNews_Title, dtype='str')
    description_series = pd.Series(AiNews_Description, dtype='str')
    date_series = pd.Series(AiNews_Date, dtype='str')
    genre_series = pd.Series(AiNews_Genre, dtype='str')
    url_series = pd.Series(AiNews_url, dtype='str')

    #Concatenate the series in a dictionaty to turn into a dataframe
    new_data = {                              #Adding the data to a dictionary to avoid array out of index error
        'Title': title_series,
        'Description': description_series,
        'Date': date_series,
        'Genre': genre_series,
        'url': url_series
    }
    
    new_df = pd.DataFrame(new_data)          #Making a dataframe with the dictionary
    AiNews = pd.concat([AiNews, new_df], ignore_index=True)   #Concatenate the new_df dataframe with AiNews

    #Dropping duplicates and NaN
    AiNews.drop_duplicates(subset=['Title'], inplace=True)
    AiNews = AiNews.dropna()
    display(AiNews)

#### Refresh data every 2 hours to get latest news:

In [8]:
import time
import threading

#Define the interval in seconds (120 minutes = 120 * 60 seconds)
interval = 120 * 60

#Define a function to run AiNews_Add at the specified interval
def run_AiNews_Refresh():
    while True:
        #Call the AiNews_Refresh function
        AiNews_Refresh()
        
        #Wait for the specified interval
        time.sleep(interval)

#Start a background thread to run the function
thread = threading.Thread(target=run_AiNews_Refresh)
thread.daemon = True
thread.start()

Unnamed: 0,Title,Description,Date,Genre,url
0,"Steve Frederickson, Lucy.ai: How AI powers a n...",In an interview at AI & Big Data Expo with Ste...,9 June 2023,Applications,https://www.artificialintelligence-news.com/20...
1,Meta’s open-source speech AI models support ov...,Advancements in machine learning and speech re...,8 June 2023,Artificial Intelligence,https://www.artificialintelligence-news.com/20...
2,Beijing launches campaign against AI-generated...,China's Cyberspace Administration (CAC) has la...,6 June 2023,Applications,https://www.artificialintelligence-news.com/20...
3,SAP taps Microsoft’s generative AI technologies,SAP and Microsoft have announced a new collabo...,31 May 2023,Applications,https://www.artificialintelligence-news.com/20...
4,OpenAI CEO: AI regulation ‘is essential’,OpenAI CEO Sam Altman testified in front of a ...,25 May 2023,Applications,https://www.artificialintelligence-news.com/20...
5,"Jay Migliaccio, IBM Watson: On leveraging AI t...",IBM has been refining its AI solutions for dec...,23 May 2023,Applications,https://www.artificialintelligence-news.com/20...
6,"Iurii Milovanov, SoftServe: How AI/ML is helpi...",Could you tell us a little bit about SoftServe...,18 May 2023,Applications,https://www.artificialintelligence-news.com/20...
7,AI and Big Data Expo North America begins in l...,The AI and Big Data Expo North is taking place...,17 May 2023,Applications,https://www.artificialintelligence-news.com/20...
8,EU committees green-light the AI Act,The Internal Market Committee and the Civil Li...,16 May 2023,Artificial Intelligence,https://www.artificialintelligence-news.com/20...
9,Wozniak warns AI will power next-gen scams,Apple co-founder Steve Wozniak has raised conc...,15 May 2023,Applications,https://www.artificialintelligence-news.com/20...


## ● PDF Extraction:

In [9]:
import PyPDF2

#Function to search for certain keywords inside the pdf file
def search_pdf_for_word(pdf_path, keywords):
    matching_paragraphs = [] #list to store the matching paragraphs
    
    with open(pdf_path, 'rb') as file:          #Open the pdf file
        pdf_reader = PyPDF2.PdfReader(file)     #Read the pdf file
        total_pages = len(pdf_reader.pages)     #Get number of pages
        
        #Loop to reach each page and extract text from it
        for page_num in range(2, total_pages):  #starting from page 2 to avoid searching in index and introduction   
            
            page = pdf_reader.pages[page_num]   #read all pages
            text = page.extract_text()          #extract text from pages
            #text = text.replace('\n', '')      # Remove "\n" characters from the text
            
            #Split text into paragraphs based on newline characters
            paragraphs = text.split('\n')       
            
            #Search for paragraphs containing the keywords (case-insensetive)
            for paragraph in paragraphs:
                if all(keyword.lower() in paragraph.lower() for keyword in keywords):#check if all words match in the paragraph
                    matching_paragraphs.append(paragraph)                            #append matching paragraph
                    
    return matching_paragraphs

#### The pdf file is about AI, but we want to search for AI news specifically, so the keywords will be "news" and "ai"

In [10]:
pdf_path = "C:/Users/osama/Desktop/Third year - Second semester/Data Engineering/Project/Project Files/AI in the News.pdf"
keywords = ['news', 'ai']

pdf_result = search_pdf_for_word(pdf_path, keywords)
pdf_result = pd.DataFrame(pdf_result, columns=["Matching results"]) #saving results in a datraframe
pdf_result

Unnamed: 0,Matching results
0,adoption and use of AI and re-examines questio...
1,"the US and Europe, I argue that the introducti..."
2,mainly flown from their control over the chann...
3,connection in the news . With the complexity a...
4,"lock-in effects, news organisations will likel..."
...,...
61,the latter into their orbit. Platform companie...
62,"a development. Despite its flaws, AI potential..."
63,"Domingo, David. 2008. “Interactivity in the Da..."
64,"Xu, Craig. 2021. ‘Australia: The News Media Ba..."


## 2 - Data Storage:

#### Storing data in MongoDB:

In [11]:
from pymongo import MongoClient

#Convert the dataframes to dictionaries/records because MongoDB uses .JSON files
csv_dict = csv_df.to_dict(orient='records')
AiNews_dict = AiNews.to_dict(orient='records')
pdf_result_dict = pdf_result.to_dict(orient='records')

#Define the sections and the corresponding data which will be saved in it
#sections_data = {
#    'CSV Extraction Data': csv_dict,
#    'Web Scraping Data': AiNews_dict,
#    'PDF Extraction Data': pdf_result_dict
#}

In [12]:
#Connect to your MongoDB database:
client = MongoClient('mongodb+srv://Deolae:Zaqw1234@cluster0.5a73pqg.mongodb.net/')
db = client['Data_Storage']

#Define a collection where you want to store your data:
csv_collection = db['CSV data']
WebScraping_collection = db['WebScraping data']
pdf_collection = db['PDF data']

#Insert the data into separate collections:
csv_collection.insert_many(csv_dict)
WebScraping_collection.insert_many(AiNews_dict)
pdf_collection.insert_many(pdf_result_dict)

#Close the MongoDB connection
client.close()