# Data Source

![Texto alternativo](\Images\Image_1.png)


# Introduction to Data Retrieval

In the information era, data has become an invaluable resource for analysis, decision-making, and the development of innovative applications. The ability to collect, analyze, and derive insights from data is an essential skill in fields as varied as data science, software development, digital marketing, and beyond.

Data retrieval is the critical first step in any data analysis project or development of data-based applications. It involves gathering information from different sources, which can be as diverse as a company's internal databases, local files, public APIs, or even web pages from which data is extracted using web scraping techniques.

This process not only involves the collection of data but also its initial inspection and understanding. Before being able to analyze the data or use it to train machine learning models, it is essential to ensure that they are accessible, relevant, and of quality.

In the following sections, we will explore various techniques and tools for data retrieval, from consuming APIs to web scraping and loading local files. Our goal is to provide you with a set of practical skills that allow you to gather data from a wide range of sources and prepare it for further analysis or development.

## Data Sources
- **Public APIs**: We will use the OpenWeatherMap API as an example to obtain weather data.
- **Web Scraping**: We will extract data from a sample web page to demonstrate how data collection from websites can be automated.
- **Local Files**: We will load and analyze data from CSV and Excel files, a common practice in data analysis.


### Public APIs 


https://openweathermap.org/api


![Texto alternativo](Images\Image_2.png)

In [2]:
import requests

api_url = "http://api.openweathermap.org/data/2.5/weather?q=Madrid,es&appid={YOUR-APIKEY}"

# Realiza la solicitud a la API
response = requests.get(api_url)
# Verifica si la solicitud fue exitosa
if response.status_code == 200:
    data = response.json()
    print("Datos obtenidos de OpenWeatherMap:")
    print(data)
else:
    print("Error al obtener los datos desde la API.")

Datos obtenidos de OpenWeatherMap:
{'coord': {'lon': -3.7026, 'lat': 40.4165}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01n'}], 'base': 'stations', 'main': {'temp': 279.77, 'feels_like': 276.23, 'temp_min': 278.15, 'temp_max': 280.58, 'pressure': 1002, 'humidity': 75}, 'visibility': 10000, 'wind': {'speed': 5.66, 'deg': 230}, 'clouds': {'all': 0}, 'dt': 1711914239, 'sys': {'type': 2, 'id': 2007545, 'country': 'ES', 'sunrise': 1711864773, 'sunset': 1711910273}, 'timezone': 7200, 'id': 3117735, 'name': 'Madrid', 'cod': 200}


### Web escraping

![Texto alternativo](\Images\Image_3.jpg)

### Herramientas de webscraping

In [3]:
import requests
from bs4 import BeautifulSoup as soup
import pandas as pd
def get_article_info(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    page_soup = soup(response.content, 'html.parser')
    author_block = page_soup.find('div', class_='AboutAuthor_aboutAuthor__content__EoaFN')
    if author_block:
        author_block.decompose()  # Esto elimina el bloque del autor del soup

    title = page_soup.find('h1')  # Asumiendo que el título está en <h1>

    # Extraer la categoría del <span> que contiene <a>
    category_span = page_soup.find('span', class_='TitleSection_titleSection__main__UjavR')
    category = category_span.find('a').text.strip() if category_span else 'No Category Found'

    # Extraer la fecha del artículo
    date_element = page_soup.find('time', class_='TitleSection_main__date__L_7Cf')
    date = date_element.text.strip() if date_element else 'No Date Found'

    # Recoger todos los elementos <h2> y <p> para el cuerpo
    body = []
    pip_elements = page_soup.find('main', class_='col__content')
    if pip_elements is not None:
        body_elements = pip_elements.find_all(['h2', 'p'])
        for element in body_elements:
            if element.find_parent('blockquote'):
                continue
            body.append(element.text.strip()) 
    else:
        print(f"Elemento 'main' con clase 'col__content' no encontrado en {url}")

    return {
        'url': url,
        'title': title.text.strip() if title else 'No Title Found',
        'category': category,
        'date': date,
        'body': ' '.join(body)
    }


def scrape_category(category_url):
    response = requests.get(category_url, headers=headers)
    soup_html = soup(response.content, 'html.parser')

    articles = []
    for link in soup_html.find_all('a', href=True):
        article_url = link['href']

        # Asegúrate de que la URL es completa y válida
        if article_url.startswith('/'):
            article_url = f'https://larepublica.pe{article_url}'
        elif not article_url.startswith('http'):
            continue  # Ignora las URLs que no son válidas

        info = get_article_info(article_url)
        articles.append(info)

    return pd.DataFrame(articles)


headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# Lista de URLs de categorías
categories = [
    'https://larepublica.pe/sociedad/',
    'https://larepublica.pe/economia/',
    'https://larepublica.pe/politica/',
    'https://larepublica.pe/deportes/',
    'https://larepublica.pe/cine-series/',
    'https://larepublica.pe/mundo/',
    'https://larepublica.pe/ciencia/',
    'https://larepublica.pe/tendencias/',
]

# Diccionario para almacenar DataFrames de cada categoría
category_dfs = {}

for category in categories:
    category_dfs[category] = scrape_category(category)

Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/ultimas-noticias
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/ultimas-noticias
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/politica
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/economia
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/sociedad
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/mundo
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/peru
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/deportes
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/espectaculos
Elemento 'main' con clase 'col__content' no encontrado en https://larepublica.pe/cine-series
El

In [4]:
df_all = pd.concat(category_dfs.values(), ignore_index=True)
df_all

Unnamed: 0,url,title,category,date,body
0,https://larepublica.pe/ultimas-noticias,NOTICIAS DE ÚLTIMO MINUTO,No Category Found,No Date Found,
1,https://larepublica.pe/politica/2024/03/31/din...,Se complica la situación procesal de Dina Bolu...,Política,31 Mar 2024 | 8:10 h,Patrimonio sospechoso. El allanamiento al domi...
2,https://larepublica.pe/politica/2024/03/29/din...,Dina Boluarte EN VIVO: últimas noticias tras a...,Política,31 Mar 2024 | 12:36 h,"Mediante un operativo, efectivos policiales al..."
3,https://larepublica.pe/sociedad/2024/03/27/hor...,CENTROS COMERCIALES en Semana Santa: ¿Atenderá...,Sociedad,29 Mar 2024 | 19:55 h,Si vas a descansar Jueves Santo o Viernes Sant...
4,https://larepublica.pe/economia/2024/03/27/ret...,Retiro CTS 2024 en Perú: ¿cuándo se deposita y...,Economía,31 Mar 2024 | 14:43 h,"En menos de 2 meses, las empresas deben realiz..."
...,...,...,...,...,...
1233,https://larepublica.pe/verificador,NOTICIAS,No Category Found,No Date Found,
1234,https://perulegal.larepublica.pe,¿Quiénes son los accesitarios de la JNJ y cómo...,No Category Found,No Date Found,
1235,https://lrmas.larepublica.pe,No Title Found,No Category Found,No Date Found,
1236,https://perubazar.pe/,No Title Found,No Category Found,No Date Found,


In [5]:
df_all.value_counts('category')

category
No Category Found    864
Deportes              67
Economía              57
Mundo                 52
Política              48
Sociedad              40
Ciencia               35
Cine y series         33
Tendencias            31
Horóscopo              8
Espectáculos           1
Tecnología             1
Video viral            1
dtype: int64

In [6]:
df_all = df_all[df_all['category'] != 'No Category Found']

In [7]:
import plotly.graph_objects as go
import pandas as pd

# Calculamos las longitudes de texto para cada categoría
text_lens = {label: [len(t.split()) for t in df_all['body'][df_all['category'] == label]] 
             for label in df_all['category'].unique()}

# Crear un boxplot horizontal
boxplot_horizontal = go.Figure()

# Agregando los datos al boxplot
for label, lens in text_lens.items():
    boxplot_horizontal.add_trace(go.Box(x=lens, name=label, orientation='h'))

# Actualizar layout del boxplot horizontal
boxplot_horizontal.update_layout(height=400, width=800, title_text="Horizontal Boxplot: Distribution of Text Lengths by Category")
boxplot_horizontal.update_xaxes(title_text="Text Length")

# Mostrar el gráfico
boxplot_horizontal.show()

# Crear un boxplot
boxplot = go.Figure()
for label, lens in text_lens.items():
    boxplot.add_trace(go.Box(y=lens, name=label))

# Actualizar layout del boxplot
boxplot.update_layout(height=400, width=800, title_text="Boxplot: Distribution of Text Lengths by Category")
boxplot.update_yaxes(title_text="Text Length")

# Crear un histograma
histogram = go.Figure()
for label, lens in text_lens.items():
    histogram.add_trace(go.Histogram(x=lens, name=label, opacity=0.6))

# Actualizar layout del histograma
histogram.update_layout(barmode='overlay', height=400, width=800, title_text="Histogram: Text Length Frequency by Category")
histogram.update_xaxes(title_text="Text Length")
histogram.update_yaxes(title_text="Frequency")
histogram.show()

### Archivos locales

In [8]:
import pandas as pd

df = pd.read_excel('Data\Puente Piedra con números.xlsx')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Nombres y Apellidos  197 non-null    object
 1   Paradero (Mañana)    197 non-null    object
 2   Paradero(Noche)      197 non-null    object
 3   Código               198 non-null    object
 4   Celular              198 non-null    object
 5   Facultad             198 non-null    object
 6   Carrera              198 non-null    object
 7   Ciclo Matriculado    198 non-null    int64 
 8   Sexo                 198 non-null    object
dtypes: int64(1), object(8)
memory usage: 14.0+ KB


In [9]:
from unidecode import unidecode
import numpy as np


df['Paradero (Mañana)'] = df['Paradero (Mañana)'].apply(lambda x: str(x).replace('\xa0',' '))
df['Paradero (Mañana)'] = df['Paradero (Mañana)'].str.strip()
df['Paradero (Mañana)'] = df['Paradero (Mañana)'].str.upper()
df['Paradero (Mañana)'] = df['Paradero (Mañana)'].apply(lambda x: unidecode(x).upper() if isinstance(x, str) else x)
df['Paradero (Mañana)'] = df['Paradero (Mañana)'].replace(['-', 'NAN'], np.nan)

df['Paradero(Noche)'] = df['Paradero(Noche)'].apply(lambda x: str(x).replace('\xa0',' '))
df['Paradero(Noche)'] = df['Paradero(Noche)'].str.strip()
df['Paradero(Noche)'] = df['Paradero(Noche)'].str.upper()
df['Paradero(Noche)'] = df['Paradero(Noche)'].apply(lambda x: unidecode(x).upper() if isinstance(x, str) else x)
df['Paradero(Noche)'] = df['Paradero(Noche)'].replace(['-', 'NAN'], np.nan)
df['Paradero(Noche)'] = df['Paradero(Noche)'].replace('ROSALUZ', 'ROSA LUZ')
df['Paradero(Noche)'] = df['Paradero(Noche)'].replace('ZAPALLAL/SAN PEDRO', 'OVALO ZAPALLAL/SAN PEDRO')

In [10]:
name_whereabouts_morning = df['Paradero (Mañana)'].unique()
name_whereabouts_ninght = df['Paradero(Noche)'].unique()

print(name_whereabouts_morning , name_whereabouts_ninght)

['FUNDICION' 'OVALO ZAPALLAL/SAN PEDRO' 'ESTABLO' 'PRO' 'NORTENO'
 'HOSPITAL' 'OVALO ZAPALLAL' 'SANTA ROSA' 'VILLA ESTELA' 'SHANGRI-LA'
 'HOGAR' 'ROSA LUZ' 'VILLASOL' nan 'CASETA' 'TRES RUEDAS' 'SANTA LUISA'
 'TRES POSTES/CASETA'] ['FUNDICION' 'OVALO ZAPALLAL/SAN PEDRO' 'ESTABLO' 'PRO' 'NORTENO' 'TOTTUS'
 'OVALO ZAPALLAL' 'SANTA ROSA' 'VILLA ESTELA' 'CRUCE' 'SHANGRI-LA' 'HOGAR'
 'ROSA LUZ' 'HOSPITAL' 'VILLASOL' nan 'CASETA' 'TOTTUS/SAN PEDRO'
 'TRES RUEDAS' 'SANTA LUISA' 'TRES POSTES/CASETA' 'ARICA']


In [12]:
whereabouts_uniques = pd.unique(np.concatenate((name_whereabouts_morning, name_whereabouts_ninght)))

df_stops = pd.DataFrame(whereabouts_uniques, columns=['Stop'])

# Remove NaN values
df_stops.dropna(inplace=True)

# Reset the index to start at 1 and use it as Stop_ID
df_stops.reset_index(drop=True, inplace=True)
df_stops.index += 1
df_stops['Stop_ID'] = df_stops.index

In [15]:
df_stops['Stop']

1                    FUNDICION
2     OVALO ZAPALLAL/SAN PEDRO
3                      ESTABLO
4                          PRO
5                      NORTENO
6                     HOSPITAL
7               OVALO ZAPALLAL
8                   SANTA ROSA
9                 VILLA ESTELA
10                  SHANGRI-LA
11                       HOGAR
12                    ROSA LUZ
13                    VILLASOL
14                      CASETA
15                 TRES RUEDAS
16                 SANTA LUISA
17          TRES POSTES/CASETA
18                      TOTTUS
19                       CRUCE
20            TOTTUS/SAN PEDRO
21                       ARICA
Name: Stop, dtype: object

In [16]:
import uuid

df['user_id'] = [uuid.uuid4() for _ in range(len(df))]

In [17]:
df_demand_morning = df.groupby('Paradero (Mañana)').size().sort_values(ascending=False)
df_demand_night = df.groupby('Paradero(Noche)').size().sort_values(ascending=False)

In [18]:
import plotly.express as px

fig = px.bar(df_demand_morning, title='Number of People per Stop', text='value')
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(yaxis=dict(range=[0, df_demand_morning.values.max() * 1.2]),
                  xaxis_title="Stop",
                  yaxis_title="Number of People",
                  xaxis={'categoryorder':'total descending'})


In [28]:
df_locations = df.loc[:,['user_id', 'Paradero (Mañana)','Paradero(Noche)']]
df_academic_info = df.loc[:, ['user_id','Facultad','Carrera']]
df_user_info = df.loc[:, ['user_id', 'Nombres y Apellidos', 'Código', 'Sexo', 'Ciclo Matriculado']]

map_stop = pd.Series(df_stops.Stop_ID.values, index=df_stops.Stop).to_dict()
df_locations['Paradero (Mañana)'] = df_locations['Paradero (Mañana)'].map(map_stop)
df_locations['Paradero(Noche)'] = df_locations['Paradero(Noche)'].map(map_stop)
# Provided coordinates
coordinates = [
    (-11.833521765869813, -77.1134069578465),
    (-11.843023105367193, -77.10170159033105),
    (-11.862169131562387, -77.0778637172763),
    (-11.93271774760639, -77.07232194267901),
    (-11.856781489866353, -77.08444414595633),
    (-11.858022651302019, -77.0829460980413),
    (-11.843141089927007, -77.1014620573036),
    (-11.79866362669378, -77.14047122579126),
    (-11.812206448958586, -77.12840638030114),
    (-11.913825733595084, -77.0736633825659),
    (-11.821866440504943, -77.12247949955984),
    (-11.88016818411576, -77.06887124552881),
    (-11.95917691167808, -77.0687340108642),
    (-11.98216243528904, -77.06528063398528),
    (-11.900875591910829, -77.06768109979721),
    (-11.95304806897095, -77.06965211160734),
    (-11.966448565987971, -77.06757828468267),
    (-11.867675146640673, -77.07235760364975),
    (-11.836445218632935, -77.11000424380381),
    (-11.864527560196388, -77.0744324096468),
    (-11.853213710610527, -77.0888838179063)
]

# Splitting the coordinates into latitude and longitude
latitudes = [coord[0] for coord in coordinates]
longitudes = [coord[1] for coord in coordinates]

# Adding the latitudes and longitudes to the DataFrame
df_stops['Latitude'] = latitudes
df_stops['Longitude'] = longitudes

df_stops


Unnamed: 0,Stop,Stop_ID,Latitud,Longitud,Latitude,Longitude
1,FUNDICION,1,-11.833522,-77.113407,-11.833522,-77.113407
2,OVALO ZAPALLAL/SAN PEDRO,2,-11.843023,-77.101702,-11.843023,-77.101702
3,ESTABLO,3,-11.862169,-77.077864,-11.862169,-77.077864
4,PRO,4,-11.932718,-77.072322,-11.932718,-77.072322
5,NORTENO,5,-11.856781,-77.084444,-11.856781,-77.084444
6,HOSPITAL,6,-11.858023,-77.082946,-11.858023,-77.082946
7,OVALO ZAPALLAL,7,-11.843141,-77.101462,-11.843141,-77.101462
8,SANTA ROSA,8,-11.798664,-77.140471,-11.798664,-77.140471
9,VILLA ESTELA,9,-11.812206,-77.128406,-11.812206,-77.128406
10,SHANGRI-LA,10,-11.913826,-77.073663,-11.913826,-77.073663


In [24]:
df_locations.to_csv('Data/locations.csv', index=False)
df_stops.to_csv('Data/stops.csv', index = False)
df_academic_info.to_csv('Data/academic_info.csv', index=False)
df_user_info.to_csv('Data/info_user.csv', index=False)

## Common Data Resources

- [Google Dataset Search](https://datasetsearch.research.google.com/)  
  A Google tool that allows searching for datasets published on various online sources. Ideal for researchers and data scientists looking for specific data.

- [Kaggle](https://www.kaggle.com/)  
  A popular platform for data science competitions that also offers a wide variety of public datasets, kernels, and notebooks.

- [Data.gov](https://data.gov/)  
  The open data portal of the United States government, offering data on a wide variety of topics, from agriculture to science and technology.

- [Datos.gob.ar](https://datos.gob.ar/)  
  The open data portal of the Argentine government, with datasets on various areas of public interest in Argentina.

- [UCI Machine Learning Repository](https://archive.ics.uci.edu/)  
  A repository of datasets commonly used in machine learning projects and education.

- [NASA Earthdata](https://www.earthdata.nasa.gov/)  
  Access to NASA's vast array of Earth observation datasets, useful for climate and geospatial research.

- [Gapminder](https://www.gapminder.org/)  
  A non-profit foundation that promotes sustainable global development by increasing the use and understanding of statistics and other data.

- [CERN Open Data Portal](https://opendata.cern.ch/)  
  A portal that offers access to CERN's research data, allowing external researchers to analyze information collected in experiments.

- [Ultralytics Explorer](https://docs.ultralytics.com/datasets/#new-ultralytics-explorer)  
  A resource for high-quality image datasets for object detection, provided by Ultralytics.

## Climate Data

- [Air Quality Index Project](https://aqicn.org/city/beijing/)  
  Provides current air quality data, including PM2.5 and other pollutants, focusing on Beijing but covering other cities globally.

- [AQICN JSON API](https://aqicn.org/json-api/doc/)  
  API documentation for programmatic access to the AQICN project's air quality data.

## Data for Agriculture

- [Our World in Data](https://ourworldindata.org/)  
  A platform that offers data and graphs on the major global issues, including sections on agricultural production and sustainability.

- [Ag-Analytics](https://ag-analytics.portal.azure-api.net/)  
  Offers APIs for agricultural data analysis, helping farmers and businesses make data-driven decisions.

## Data for Materials

- [Pymatgen](https://pymatgen.org/)  
  A Python library for materials analysis, which allows access to materials databases like the Materials Project.

- [Open Materials Database](https://omdb.mathub.io/)  
  An accessible database for research in the field of materials science, with a focus on electronic properties.

## Data for Biology

- [IntAct Molecular Interaction Database](https://www.ebi.ac.uk/intact/home)  
  An open database of molecular interactions, maintained by the European Bioinformatics Institute.

- [Host-Pathogen Interaction Database (HPIDB)](https://hpidb.igbb.msstate.edu/index.html)  
  An integrated database of host-pathogen interactions, useful for research in biology and infectious diseases.

- [Eye Gaze Dataset](https://github.com/cxr-eye-gaze/eye-gaze-dataset?tab=readme-ov-file)  
  Dataset for the analysis of eye movement, available on GitHub.

- [PhysioNet](https://physionet.org/)  
  A free-access resource for the storage and sharing of records related to physiology.

## Self-driving

- [A2D2](https://www.a2d2.audi/a2d2/en.html): Audi Autonomous Driving Dataset, data for the development of autonomous driving technologies.
- [AV Datasets on GitHub](https://github.com/klintan/av-datasets?tab=readme-ov-file): A curated list of autonomous vehicle datasets available on GitHub.
- [Waymo Open Dataset](https://waymo.com/open/): High-quality datasets for autonomous driving research, offered by Waymo.

## Astronomy

- [SDSS](https://www.sdss.org/): Data from the Sloan Digital Sky Survey, which has created the most detailed maps of the Universe.
- [AstroML Resources](https://www.astroml.org/user_guide/resources.html): Resources and datasets for machine learning in astronomy, offered by AstroML.
