# Data Gathering

## Sources of Data

A vast amount of historical data can be found in files such as:
- MS Word documents
- Emails
- Spreadsheets
- MS PowerPoints
- PDFs
- HTML
- and plaintext files
Public and Private Archives

CSV, JSON, and XML files use plaintext, a common format, and are compatible with a wide range of applications

The Web can be mined for data using a web scraping application
The IoT uses sensors create data

Sensors in smartphones, cars, airplanes, street lamps, and home appliances capture raw data

## Open Data and Private Data

1. Open Data
   - The Open Knowledge Foundation describes Open Data as “any content, information or data that people are free to use, reuse, and redistribute without any legal,
technological, or social restriction.”
2. Private Data
   - Data related to an expectation of privacy and regulated by a particular country/government

## Structured and Unstructured Data

1. Structured Data
   - Data entered and maintained in fixed fields within a file or record Easily entered, classified, queried, and analyzed Relational databases or spreadsheets
2. Unstructured Data Lacks organization
    - Raw data Photo contents, audio, video, web pages, blogs, books, journals, white papers, PowerPoint presentations, articles, email, wikis, word processing
documents, and text in general

## Example of gathering image data using webcam

In [1]:
import cv2
key = cv2. waitKey(1)
webcam = cv2.VideoCapture(0)
while True:
    try:
        check, frame = webcam.read()
        print(check) #prints true as long as the webcam is running
        print(frame) #prints matrix values of each framecd
        cv2.imshow("Capturing", frame)
        key = cv2.waitKey(1)
        if key == ord('s'):
            cv2.imwrite(filename='saved_img.jpg', img=frame)
            webcam.release()
            img_new = cv2.imread('saved_img.jpg', cv2.IMREAD_GRAYSCALE)
            img_new = cv2.imshow("Captured Image", img_new)
            cv2.waitKey(1650)
            cv2.destroyAllWindows()
            print("Processing image...")
            img_ = cv2.imread('saved_img.jpg', cv2.IMREAD_ANYCOLOR)
            print("Converting RGB image to grayscale...")
            gray = cv2.cvtColor(img_, cv2.COLOR_BGR2GRAY)
            print("Converted RGB image to grayscale...")
            print("Resizing image to 28x28 scale...")
            img_ = cv2.resize(gray,(28,28))
            print("Resized...")
            img_resized = cv2.imwrite(filename='saved_img-final.jpg', img=img_)
            print("Image saved!")
            break
        elif key == ord('q'):
            print("Turning off camera.")
            webcam.release()
            print("Camera off.")
            print("Program ended.")
            cv2.destroyAllWindows()
            break
    except(KeyboardInterrupt):
        print("Turning off camera.")
        webcam.release()
        print("Camera off.")
        print("Program ended.")
        cv2.destroyAllWindows()
        break



True
[[[230 224 213]
  [229 223 212]
  [229 221 210]
  ...
  [148 185 155]
  [145 182 153]
  [143 180 151]]

 [[229 223 211]
  [228 222 210]
  [224 218 206]
  ...
  [146 185 150]
  [143 181 148]
  [145 183 150]]

 [[231 226 206]
  [231 226 206]
  [225 219 202]
  ...
  [146 188 152]
  [143 186 147]
  [144 188 146]]

 ...

 [[ 86  54  47]
  [ 86  54  47]
  [ 86  54  47]
  ...
  [ 47  44  45]
  [ 49  44  44]
  [ 49  44  43]]

 [[ 87  55  48]
  [ 87  55  48]
  [ 84  54  46]
  ...
  [ 47  44  45]
  [ 46  46  45]
  [ 45  45  43]]

 [[ 81  56  48]
  [ 80  55  47]
  [ 78  53  45]
  ...
  [ 45  45  45]
  [ 45  45  44]
  [ 43  43  41]]]
True
[[[239 222 216]
  [231 214 208]
  [233 217 208]
  ...
  [156 183 158]
  [161 184 157]
  [164 184 158]]

 [[235 224 213]
  [234 221 210]
  [232 216 206]
  ...
  [156 183 158]
  [157 185 157]
  [160 188 159]]

 [[243 227 217]
  [242 226 216]
  [231 215 205]
  ...
  [157 185 157]
  [154 184 155]
  [151 181 152]]

 ...

 [[ 83  57  52]
  [ 81  55  50]
  [ 79  55

## Example of gathering voice data using microphone

In [2]:
# import required libraries
import sounddevice as sd
from scipy.io.wavfile import write
import wavio as wv
# Sampling frequency
freq = 44100
# Recording duration
duration = 5
# Start recorder with the given values
# of duration and sample frequency
recording = sd.rec(int(duration * freq),
samplerate=freq, channels=2)
# Record audio for the given number of seconds
sd.wait()
# This will convert the NumPy array to an audio
# file with the given sampling frequency
write("recording0.wav", freq, recording)
# Convert the NumPy array to audio file
wv.write("recording1.wav", recording, freq, sampwidth=2)



In [3]:
!ls

dataGathering.ipynb  Pictures  recording0.wav  recording1.wav


## Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World
Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated
processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or
spreadsheet, for later retrieval or analysis.

## Image Scraping using BeautifulSoup and Request

In [4]:
import requests
from bs4 import BeautifulSoup
def getdata(url):
    r = requests.get(url)
    return r.text
    
htmldata = getdata("https://www.google.com/")
soup = BeautifulSoup(htmldata, 'html.parser')
for item in soup.find_all('img'):
    print(item['src'])

/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png


## Image Scraping using Selenium

In [8]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
import time
import requests
import shutil
import os
import getpass
import urllib.request
import io
import time
from PIL import Image

In [16]:
user = getpass.getuser()
firefox_service = Service(executable_path = '/usr/bin/geckodriver')
firefox_options = webdriver.FirefoxOptions()
# firefox_options.add_argument('--headless')#
# firefox_options.add_argument('--no-sandbox')#
# firefox_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Firefox(service=firefox_service, options=firefox_options)
def scroll_to_end(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)#sleep_between_interactions

    
def getImageUrls(name,totalImgs,driver):
    search_url = "https://www.google.com/search?q=cat&tbm=isch&ved=2ahUKEwjNn_Gn7YyFAxU3yDgGHQYQCesQ2-cCegQIABAA&oq=cat&gs_lp=EgNpbWciA2NhdDINEAAYgAQYigUYQxixAzIIEAAYgAQYsQMyDhAAGIAEGIoFGLEDGIMBMggQABiABBixAzILEAAYgAQYsQMYgwEyCBAAGIAEGLEDMggQABiABBixAzIFEAAYgAQyCBAAGIAEGLEDMggQABiABBixA0iqGVCADlilF3AAeACQAQCYAVegAc0CqgEBNLgBA8gBAPgBAYoCC2d3cy13aXotaW1nwgIKEAAYgAQYigUYQ4gGAQ&sclient=img&ei=WBYAZs2TMLeQ4-EPhqCk2A4&bih=568&biw=1251&hl=en"
    driver.get(search_url)
    img_urls = set()
    img_count = 0
    results_start = 0
    
    while(img_count+results_start<totalImgs): #Extract actual images now
        scroll_to_end(driver)
        totalResults = driver.find_elements(By.CLASS_NAME,"Q4LuWd")       
        print('total results:', len(totalResults))
        print(f"Found: {len(totalResults)} search results. Extracting links from{results_start}:{len(totalResults)}")
        for img in totalResults[results_start:totalImgs]:
            img.click()
            time.sleep(5)
            image = driver.find_element(By.CLASS_NAME,'iPVvYb')
            img_urls.add(image.get_attribute('src'))
            img_count=len(img_urls)
                    
    return img_urls

def downloadImages(folder_path,file_name,url):
    try:
        image_content = requests.get(url).content
    except Exception as e:
        print(f"ERROR - COULD NOT DOWNLOAD {url} - {e}")
    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path, file_name)
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SAVED - {url} - AT: {file_path}")
    except Exception as e:
        print(f"ERROR - COULD NOT SAVE {url} - {e}")
        
def saveInDestFolder(searchNames,destDir,totalImgs,driver):
    for name in list(searchNames):
        path=os.path.join(destDir,name)
        if not os.path.isdir(path):
            os.mkdir(path)
        print('Current Path',path)
        totalLinks=getImageUrls(name,totalImgs,driver)
        print('totalLinks',totalLinks)
        
    if totalLinks is None:
        print('images not found for :',name)
        
    else:
        for i, link in enumerate(totalLinks):
            file_name = f"{i:150}.jpg"
            downloadImages(path,file_name,link)
            
searchNames=['cat']
destDir=f'/home/kurtymittens/Jupyter/Data Gathering/Pictures'
totalImgs=5

saveInDestFolder(searchNames,destDir,totalImgs,driver)

Current Path /home/kurtymittens/Jupyter/Data Gathering/Pictures/cat
total results: 100
Found: 100 search results. Extracting links from0:100
totalLinks {'https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_square.jpg', 'https://cdn.britannica.com/70/234870-050-D4D024BB/Orange-colored-cat-yawns-displaying-teeth.jpg', 'https://cdn.britannica.com/34/235834-050-C5843610/two-different-breeds-of-cats-side-by-side-outdoors-in-the-garden.jpg', 'https://media.4-paws.org/5/b/4/b/5b4b5a91dd9443fa1785ee7fca66850e06dcc7f9/VIER%20PFOTEN_2019-12-13_209-2890x2000-1920x1329.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Cat_August_2010-4.jpg/1200px-Cat_August_2010-4.jpg'}
SAVED - https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_square.jpg - AT: /home/kurtymittens/Jupyter/Data Gathering/Pictures/cat/                                                                                                                

## Web Scraping of Movies Information using BeautifulSoup
We want to analyze the distributions of IMDB and Metacritic movie ratings to see if we find anything interesting. To do this, weʼll first scrape data for over 2000 movie

In [1]:
from requests import get
url = 'https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
agent = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0",
        "referer":"https://www.imdb.com/"}
response = get(url, headers=agent)
print(response.text[:500])

<!DOCTYPE html><html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><script>window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
             


## Using BeautifulSoup to parse the HTML content

To parse our HTML document and extract the 50 div containers, weʼll use a Python module called BeautifulSoup, the most common web scraping module for Python.
In the following code cell we will:
- Import the BeautifulSoup class creator from the package bs4.
- Parse response.text by creating a BeautifulSoup object, and assign this object to html_soup. The 'html.parser' argument indicates that we want to do the parsing using
Pythonʼs built-in HTML parser.

In [2]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
headers = {'Accept-Language': 'en-US,en;q=0.8'}
type(html_soup)

bs4.BeautifulSoup

Before extracting the 50 div containers, we need to figure out what distinguishes them from other div elements on that page. Often, the distinctive mark resides in the class
attribute. If you inspect the HTML lines of the containers of interest, youʼll notice that the class attribute has two values: lister-item and mode-advanced. This combination is
unique to these div containers. We can see thatʼs true by doing a quick search (Ctrl + F). We have 50 such containers, so we expect to see only 50 matches:

Now letʼs use the find_all() method to extract all the div containers that have a class attribute of lister-item mode-advanced

In [3]:
# It seems the website is new, the structure changed
movie_containers = html_soup.find_all('li', class_ = 'ipc-metadata-list-summary-item')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


find_all() returned a ResultSet object which is a list containing all the 50 divs we are interested in.

Now weʼll select only the first container, and extract, by turn, each item of interest:
- The name of the movie.
- The year of release.
- The IMDB rating.
- The Metascore.
- The number of votes.

In [4]:
first_movie = movie_containers[0]
print(first_movie)

<li class="ipc-metadata-list-summary-item"><div class="ipc-metadata-list-summary-item__c"><div class="ipc-metadata-list-summary-item__tc"><span aria-disabled="false" class="ipc-metadata-list-summary-item__t"></span><div class="sc-ab6fa25a-3 bVYfLY dli-parent"><div class="sc-ab6fa25a-2 gOsifL"><div class="sc-e5a25b0f-0 jQjDIb dli-poster-container"><div class="ipc-poster ipc-poster--base ipc-poster--dynamic-width ipc-sub-grid-item ipc-sub-grid-item--span-2" role="group"><div aria-label="add to watchlist" class="ipc-watchlist-ribbon ipc-focusable ipc-watchlist-ribbon--s ipc-watchlist-ribbon--base ipc-watchlist-ribbon--loading ipc-watchlist-ribbon--onImage ipc-poster__watchlist-ribbon" role="button" tabindex="0"><svg class="ipc-watchlist-ribbon__bg" height="34px" role="presentation" viewbox="0 0 24 34" width="24px" xmlns="http://www.w3.org/2000/svg"><polygon class="ipc-watchlist-ribbon__bg-ribbon" fill="#000000" points="24 0 0 0 0 32 12.2436611 26.2926049 24 31.7728343"></polygon><polygon 

In [5]:
first_movie.div.div.div.find_all('div', class_ = "sc-b0691f29-0 jbYPfh") # all needed things to this activity

[<div class="sc-b0691f29-0 jbYPfh"><div class="ipc-title ipc-title--base ipc-title--title ipc-title-link-no-icon ipc-title--on-textPrimary sc-b0691f29-9 klOwFB dli-title"><a class="ipc-title-link-wrapper" href="/title/tt3315342/?ref_=sr_t_1" tabindex="0"><h3 class="ipc-title__text">1. Logan</h3></a></div><div class="sc-b0691f29-7 hrgukm dli-title-metadata"><span class="sc-b0691f29-8 ilsLEX dli-title-metadata-item">2017</span><span class="sc-b0691f29-8 ilsLEX dli-title-metadata-item">2h 17m</span><span class="sc-b0691f29-8 ilsLEX dli-title-metadata-item">R-16</span></div><span class="sc-b0691f29-1 grHDBY"><div class="sc-e2dbc1a3-0 ajrIH sc-b0691f29-2 bhhtyj dli-ratings-container" data-testid="ratingGroup--container"><span aria-label="IMDb rating: 8.1" class="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating" data-testid="ratingGroup--imdb-rating"><svg class="ipc-icon ipc-icon--star-inline" fill="currentColor" height="24" role="presentation" viewbox="0 

## Name

In [6]:
first_movie.find('h3',class_ = "ipc-title__text").text[3:].split()[0]
# First Movie text

'Logan'

## Year of the Movie

In [7]:
first_year = first_movie.find("span", class_="sc-b0691f29-8 ilsLEX dli-title-metadata-item")
first_year.text
#Year of the First Movie

'2017'

## IMDB rating

In [8]:
first_rating = first_movie.find("span",class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating").text
float(first_rating[:3])

8.1

## Metacritic

In [9]:
first_metacritic = first_movie.find("span", class_="sc-b0901df4-0 bcQdDJ metacritic-score-box").text
int(first_metacritic)

77

## Votes

In [10]:
first_numOfVotes = first_movie.find('span', class_="ipc-rating-star--voteCount").text
first_numOfVotes.strip()[first_numOfVotes.find('('):first_numOfVotes.find(')')-1]

'827K'

## Script

In [11]:
name = []
year = []
imdb = []
metacritic = []
votes = []

for containers in movie_containers:
    if containers.find("span", class_="sc-b0901df4-0 bcQdDJ metacritic-score-box") is not None:
        # get the movie name:
        name.append(" ".join(containers.find('h3',class_ = "ipc-title__text").text[3:].split())) # removes the number
        # get the movie year
        year.append(containers.find("span", class_="sc-b0691f29-8 ilsLEX dli-title-metadata-item").text) # movie year by a string
        # get the imdb rating
        imdb.append(
            float(containers.find("span",class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating").text[:3])
        ) # all the rating are in floating numbers
        # get the metacritic score
        metacritic.append(int(containers.find("span", class_="sc-b0901df4-0 bcQdDJ metacritic-score-box").text))
        # get the votes number
        votes.append(
            containers.find('span', class_="ipc-rating-star--voteCount").text.strip()[first_numOfVotes.find('('):first_numOfVotes.find(')')-1]
        ) # removes tabspace created by the html file

    

## Dataframe

In [12]:
import pandas as pd

movie_df = pd.DataFrame(
    {"name":name,
     "year":year,
     "imdb":imdb,
     "metascore":metacritic,
     "votes":votes})

print(movie_df.info())
movie_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       41 non-null     object 
 1   year       41 non-null     object 
 2   imdb       41 non-null     float64
 3   metascore  41 non-null     int64  
 4   votes      41 non-null     object 
dtypes: float64(1), int64(1), object(3)
memory usage: 1.7+ KB
None


Unnamed: 0,name,year,imdb,metascore,votes
0,Logan,2017,8.1,77,827K
1,Thor: Ragnarok,2017,7.9,74,813K
2,Guardians of the Galaxy Vol. 2,2017,7.6,67,756K
3,Dunkirk,2017,7.8,94,736K
4,Spider-Man: Homecoming,2017,7.4,73,716K
5,Wonder Woman,2017,7.3,76,698K
6,Get Out,2017,7.8,85,691K
7,Star Wars: Episode VIII - The Last Jedi,2017,6.9,84,670K
8,Blade Runner 2049,2017,8.0,81,658K
9,Baby Driver,2017,7.5,86,605K


## Multiple pages

In [25]:
from time import time
from time import sleep
from random import randint
from IPython.core.display import clear_output

years_url = ['2005','2006','2007','2008','2009','2010','2011',
             '2012','2013', '2014', '2015', '2016','2017', '2018', '2019', '2020', '2021', '2022', '2023']
# Redeclaring the lists to store data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []


# Preparing the monitoring of the loop
start_time = time()
requests = 0
agent = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0",
        "referer":"https://www.imdb.com/"}
# For every year in the interval 2000-2017
for year_url in years_url:
    # Make a get request
    response = get('https://www.imdb.com/search/title/?release_date=' + year_url +'-01-01,' + year_url +'-12-31&sort=num_votes,desc', 
                       headers = agent)
        
    # Pause the loop
    sleep(randint(8,15))
        
    # Monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)
        
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        print('Request: {}; Status code: {}'.format(requests, response.status_code))
            
    # Break the loop if the number of requests is greater than expected
    if requests > 72:
        print('Number of requests was greater than expected.')
        break
    # Parse the content of the request with BeautifulSoup
    page_html = BeautifulSoup(response.text, 'html.parser')
    # Select all the 50 movie containers from a single page
    mv_containers = page_html.find_all('li', class_ = 'ipc-metadata-list-summary-item')
        
    # For every movie of these 50
    for container in mv_containers:
        # If the movie has a Metascore, then:
        if container.find("span", class_="sc-b0901df4-0 bcQdDJ metacritic-score-box") is not None:
            # get the movie name:
            names.append(" ".join(container.find('h3',class_ = "ipc-title__text").text[3:].split())) # removes the number
            # get the movie year
            years.append(container.find("span", class_="sc-b0691f29-8 ilsLEX dli-title-metadata-item").text) # movie year by a string
            # get the imdb rating
            imdb_ratings.append(
                float(container.find("span",class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating").text[:3])
            ) # all the rating are in floating numbers
            # get the metacritic score
            metascores.append(int(container.find("span", class_="sc-b0901df4-0 bcQdDJ metacritic-score-box").text))
            # get the votes number
            votes.append(
                container.find('span', class_="ipc-rating-star--voteCount").text.strip()[first_numOfVotes.find('('):first_numOfVotes.find(')')-1]
            ) # removes tabspace created by the html file


Request:19; Frequency: 0.06815909585348734 requests/s


In [26]:
names

['Batman Begins',
 'V for Vendetta',
 'Star Wars: Episode III - Revenge of the Sith',
 'Sin City',
 'Harry Potter and the Goblet of Fire',
 'Mr. & Mrs. Smith',
 'Charlie and the Chocolate Factory',
 'War of the Worlds',
 'The 40 Year Old Virgin',
 'King Kong',
 'Madagascar',
 'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe',
 'Brokeback Mountain',
 'Wedding Crashers',
 'Constantine',
 'Fantastic Four',
 'Lord of War',
 'Hitch',
 'Pride & Prejudice',
 'The Island',
 'Kingdom of Heaven',
 'Serenity',
 'Corpse Bride',
 'Saw II',
 'Walk the Line',
 'A History of Violence',
 'The Descent',
 'Munich',
 'Kiss Kiss Bang Bang',
 'Thank You for Smoking',
 'Match Point',
 'Transporter 2',
 "The Hitchhiker's Guide to the Galaxy",
 'Jarhead',
 'Cinderella Man',
 'The Longest Yard',
 'Hostel',
 'Flightplan',
 'Coach Carter',
 'The Prestige',
 'The Departed',
 '300',
 "Pirates of the Caribbean: Dead Man's Chest",
 'El Laberinto Del Fauno',
 'Casino Royale',
 'Blood Diamond',
 'The Pur

In [27]:
len(names)

797

In [28]:
movie_ratings = pd.DataFrame(
    {"name":names,
     "year":years,
     "imdb":imdb_ratings,
     "metascore":metascores,
     "votes":votes})

print(movie_ratings.info())
movie_ratings.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 797 entries, 0 to 796
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       797 non-null    object 
 1   year       797 non-null    object 
 2   imdb       797 non-null    float64
 3   metascore  797 non-null    int64  
 4   votes      797 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 31.3+ KB
None


Unnamed: 0,name,year,imdb,metascore,votes
0,Batman Begins,2005,8.2,70,1.6M
1,V for Vendetta,2005,8.2,62,1.2M
2,Star Wars: Episode III - Revenge of the Sith,2005,7.6,68,842K
3,Sin City,2005,8.0,74,792K
4,Harry Potter and the Goblet of Fire,2005,7.7,81,678K
5,Mr. & Mrs. Smith,2005,6.5,55,539K
6,Charlie and the Chocolate Factory,2005,6.7,72,527K
7,War of the Worlds,2005,6.5,73,474K
8,The 40 Year Old Virgin,2005,7.1,73,465K
9,King Kong,2005,7.2,81,445K


In [31]:
movie_ratings.to_csv("movie_ratings.csv")

## Data Preparation

- Collected data may not be compatible or formatted correctly
- Data must be prepared before it can be added to a data set
- Extract, Transform and Load (ETL)

## Data preprocessing

Data Processing is a process of cleaning the raw data i.e. the data is collected in the real world and is converted to a clean data set. In other words, whenever the data is
gathered from different sources it is collected in a raw format and this data isnʼt feasible for the analysis. Therefore, certain steps are executed to convert the data into a small
clean data set, this part of the process is called as data preprocessing.
Most of the real-world data is messy, some of these types of data are: 1. Missing data: Missing data can be found when it is not continuously created or due to technical
issues in the application (IOT system). 2. Noisy Data This type of data is also called outliners, this can occur due to human errors (human manually gathering the data) or
some technical problem of the device at the time of collection of data. 3. Inconsistent data: This type of data might be collected due to human errors (mistakes with the
name or values) or duplication of data.
These are some of the basic pre processing techniques that can be used to convert raw data. 1. Conversion of data: As we know that Machine Learning models can only
handle numeric features, hence categorical and ordinal data must be somehow converted into numeric features. 2. Ignoring the missing values: Whenever we encounter
missing data in the data set then we can remove the row or column of data depending on our need. This method is known to be efficient but it shouldnʼt be performed if there
are a lot of missing values in the dataset. 3. Filling the missing values: Whenever we encounter missing data in the data set then we can fill the missing data manually, most
commonly the mean, median or highest frequency value is used.

In [32]:
movie_ratings['year'].unique()

array(['2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020',
       '2021', '2022', '2023'], dtype=object)

In [33]:
movie_ratings.dtypes

name          object
year          object
imdb         float64
metascore      int64
votes         object
dtype: object

In [36]:
movie_ratings.head(10)

Unnamed: 0,name,year,imdb,metascore,votes
0,Batman Begins,2005,8.2,70,1.6M
1,V for Vendetta,2005,8.2,62,1.2M
2,Star Wars: Episode III - Revenge of the Sith,2005,7.6,68,842K
3,Sin City,2005,8.0,74,792K
4,Harry Potter and the Goblet of Fire,2005,7.7,81,678K
5,Mr. & Mrs. Smith,2005,6.5,55,539K
6,Charlie and the Chocolate Factory,2005,6.7,72,527K
7,War of the Worlds,2005,6.5,73,474K
8,The 40 Year Old Virgin,2005,7.1,73,465K
9,King Kong,2005,7.2,81,445K


In [37]:
movie_ratings.tail(10)

Unnamed: 0,name,year,imdb,metascore,votes
787,La sociedad de la nieve,2023,7.8,72,122K
788,The Marvels,2023,5.6,50,119K
789,Scream VI,2023,6.5,61,118K
790,Fast X,2023,5.8,56,117K
791,Knock at the Cabin,2023,6.1,63,114K
792,Sound of Freedom,2023,7.7,36,111K
793,Asteroid City,2023,6.5,75,110K
794,A Haunting in Venice,2023,6.5,63,109K
795,The Hunger Games: The Ballad of Songbirds & Sn...,2023,6.8,54,108K
796,The Equalizer 3,2023,6.8,58,107K


## END