# Open Data

__Open Data__ refers to the concept of making _data freely available_ to the public, without any restrictions on its use, reuse, or redistribution. It is typically provided in a _machine-readable format_ and can be accessed and used by anyone for various purposes, such as research, analysis, and innovation. __Open Data__ plays a crucial role in promoting transparency, accountability, and collaboration in both the public and private sectors.

### Download and load CSV files

To load data from a _CSV_ file into a DataFrame, you can use the `df = pd.read_csv(url)` function from the pandas library. Use the actual _URL_ of the _CSV_ file you want to load. This code will download the CSV file from the specified URL and load it into the `df` DataFrame.

In [3]:
import pandas as pd
import numpy as np
# https://catalog.data.gov/dataset/electric-vehicle-population-data

# URL of the CSV file
url_csv = 'https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD'
# Download the CSV file and load it into a DataFrame
df = pd.read_csv(url_csv)
print(df.shape)

#convert Column names
df = df.rename(columns={'Make':'Brand'})

#drop nulls
df['County'] = df['County'].fillna('Unknown')
df['City'] = df['City'].fillna('Unknown')
df = df.dropna()

#Convert to categorical features
df['County'] = df['County'].astype('category')
df['City'] = df['City'].astype('category')
df['Brand'] = df['Brand'].astype('category')
df['Model'] = df['Model'].astype('category')
df['Electric Vehicle Type'] = df['Electric Vehicle Type'].astype('category')

#Change Datatypes
df['Model Year'] = df['Model Year'].astype('int16')
df['Legislative District'] = df['Legislative District'].astype('int8')
df['2020 Census Tract'] = df['2020 Census Tract'].astype('int64')

print(len(np.unique(df['VIN (1-10)'])))
print(len(np.unique(df['DOL Vehicle ID'])))

print(df.info())
df.head()

(191407, 17)
11480
190989
<class 'pandas.core.frame.DataFrame'>
Index: 190989 entries, 0 to 191406
Data columns (total 17 columns):
 #   Column                                             Non-Null Count   Dtype   
---  ------                                             --------------   -----   
 0   VIN (1-10)                                         190989 non-null  object  
 1   County                                             190989 non-null  category
 2   City                                               190989 non-null  category
 3   State                                              190989 non-null  object  
 4   Postal Code                                        190989 non-null  float64 
 5   Model Year                                         190989 non-null  int16   
 6   Brand                                              190989 non-null  category
 7   Model                                              190989 non-null  category
 8   Electric Vehicle Type                      

Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Brand,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,5YJSA1E22K,King,Seattle,WA,98112.0,2019,TESLA,MODEL S,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,270,0,43,202233958,POINT (-122.300312 47.629782),CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),53033006500
1,3MW39FS05R,Yakima,Zillah,WA,98953.0,2024,BMW,330E,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,20,0,15,264425178,POINT (-120.2658133 46.4063477),PACIFICORP,53077002201
2,1N4AZ0CP0F,King,Kent,WA,98031.0,2015,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,84,0,11,114962025,POINT (-122.201564 47.402358),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),53033029306
3,5YJSA1H20F,Snohomish,Bothell,WA,98012.0,2015,TESLA,MODEL S,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,208,0,1,232724670,POINT (-122.206146 47.839957),PUGET SOUND ENERGY INC,53061052107
4,JTMAB3FV1N,Yakima,Yakima,WA,98908.0,2022,TOYOTA,RAV4 PRIME,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,42,0,14,221023589,POINT (-120.611068 46.596645),PACIFICORP,53077000902


In [4]:
#how many electric cars are by state?
print(np.unique(df['State']))
car_by_city_df = df.groupby('City').agg({'VIN (1-10)': 'count'}).reset_index()
car_by_city_df = car_by_city_df.rename(columns={'VIN (1-10)' : 'Count'})
car_by_city_df = car_by_city_df.set_index('City')
car_by_city_df

['WA']


  car_by_city_df = df.groupby('City').agg({'VIN (1-10)': 'count'}).reset_index()


Unnamed: 0_level_0,Count
City,Unnamed: 1_level_1
Aberdeen,175
Acme,10
Addy,2
Adna,1
Airway Heights,31
...,...
Yacolt,59
Yakima,755
Yarrow Point,151
Yelm,317


In [5]:
# What is the electric range per brand and model?
filtered_df = df[df['Electric Range'].notna() & (df['Electric Range'] > 0)]
print(filtered_df.shape)
range_df = filtered_df.groupby(['Brand', 'Model'], observed=False).agg({'Electric Range':'mean'}).reset_index()
range_df = range_df.sort_values(by='Electric Range', ascending=False)
range_df

(89036, 17)


Unnamed: 0,Brand,Model,Electric Range
5377,TESLA,MODEL Y,291.000000
2430,HYUNDAI,KONA,258.000000
1048,CHEVROLET,BOLT EV,244.853071
5376,TESLA,MODEL X,240.325678
5374,TESLA,MODEL 3,238.508762
...,...,...,...
6169,WHEEGO ELECTRIC CARS,XC40,
6170,WHEEGO ELECTRIC CARS,XC60,
6171,WHEEGO ELECTRIC CARS,XC90,
6172,WHEEGO ELECTRIC CARS,XM,


### Get a JSON from a URL and load

To load data from a _JSON_ file into a DataFrame, you can use the `df = pd.read_json(url)` function from the pandas library. Use the actual _URL_ of the _JSON_ file you want to load. This code will download the _JSON_ file from the specified URL and load it into the `df` DataFrame.

In [6]:
# https://jsonplaceholder.typicode.com/
import requests
# URL of the JSON file
url_json = 'https://jsonplaceholder.typicode.com/comments'
# Download the JSON file
response = requests.get(url_json)
data = response.json()
print(type(data))
# Load the JSON data into a DataFrame
df = pd.DataFrame(data)
print(df.shape)
df.head()

<class 'list'>
(500, 5)


Unnamed: 0,postId,id,name,email,body
0,1,1,id labore ex et quam laborum,Eliseo@gardner.biz,laudantium enim quasi est quidem magnam volupt...
1,1,2,quo vero reiciendis velit similique earum,Jayne_Kuhic@sydney.com,est natus enim nihil est dolore omnis voluptat...
2,1,3,odio adipisci rerum aut animi,Nikita@garfield.biz,quia molestiae reprehenderit quasi aspernatur\...
3,1,4,alias odio sit,Lew@alysha.tv,non et atque\noccaecati deserunt quas accusant...
4,1,5,vero eaque aliquid doloribus et culpa,Hayden@althea.biz,harum non quasi et ratione\ntempore iure ex vo...


### Consumming Free Public APIs

To load data from a _public web API_ into a DataFrame, you can use the `pd.read_json(url)` function from the pandas library. Replace the `url` variable with the actual URL of the _API endpoint_.

In [7]:
# NASA picture of the day API
# https://api.nasa.gov/
from datetime import datetime as dt, timedelta
import requests as r
import sys
# NASA API key
api_key = 'RZd13PnO7qzf8YpH1zcmmFsZvzDaZNHfDeSyJ9i3'
# Calculate dates
end_date = (dt.now() - timedelta(days=1)).strftime("%Y-%m-%d")
start_date = (dt.now() - timedelta(days=31)).strftime("%Y-%m-%d")
# API URL with parameters
api_url = f'https://api.nasa.gov/planetary/apod?api_key={api_key}&start_date={start_date}&end_date={end_date}'
# Make the API request
response = requests.get(api_url)
if response.status_code == 200:
    nasa_data = response.json()
else:
    print('ERROR. Failed to retrieve data from NASA.')
    sys.exit()

# Load the JSON data into a DataFrame
nasa_df = pd.DataFrame(nasa_data)
nasa_df = nasa_df.drop(columns=['service_version'], axis=1)
#print(np.unique(nasa_df['service_version'])) service version SI es unico, e irrelevante, por lo que se borra
nasa_df['media_type'] = nasa_df['media_type'].astype('category')
nasa_df['date'] = pd.to_datetime(nasa_df['date'])
nasa_df = nasa_df[['date', 'title', 'explanation', 'url', 'hdurl', 'media_type', 'copyright']]
nasa_df.to_csv('csv-files/nasa_photos.csv', index=False)
print(nasa_df.shape)
nasa_df.head(8)


(31, 7)


Unnamed: 0,date,title,explanation,url,hdurl,media_type,copyright
0,2024-06-03,NGC 2403 in Camelopardalis,Magnificent island universe NGC 2403 stands wi...,https://apod.nasa.gov/apod/image/2405/NGC2403-...,https://apod.nasa.gov/apod/image/2405/NGC2403-...,image,(Team F.A.C.T.)
1,2024-06-04,Comet Pons-Brooks Develops Opposing Tails,Why does Comet Pons-Brooks now have tails poin...,https://apod.nasa.gov/apod/image/2406/Comet12P...,https://apod.nasa.gov/apod/image/2406/Comet12P...,image,\nRolando Ligustri &\n Lukas Demetz \n
2,2024-06-05,Shadow of a Martian Robot,What if you saw your shadow on Mars and it was...,https://apod.nasa.gov/apod/image/2406/NeretvaV...,https://apod.nasa.gov/apod/image/2406/NeretvaV...,image,
3,2024-06-06,NGC 4565: Galaxy on Edge,Magnificent spiral galaxy NGC 4565 is viewed e...,https://apod.nasa.gov/apod/image/2406/278_lora...,https://apod.nasa.gov/apod/image/2406/278_lora...,image,Lóránd Fényes
4,2024-06-07,SH2-308: The Dolphin Head Nebula,"Blown by fast winds from a hot, massive star, ...",https://apod.nasa.gov/apod/image/2406/DolphinN...,https://apod.nasa.gov/apod/image/2406/DolphinN...,image,Prabhu Kutti
5,2024-06-08,Pandora's Cluster of Galaxies,This deep field mosaicked image presents a stu...,https://apod.nasa.gov/apod/image/2406/abell274...,https://apod.nasa.gov/apod/image/2406/abell274...,image,
6,2024-06-09,How to Identify that Light in the Sky,What is that light in the sky? The answer to o...,https://apod.nasa.gov/apod/image/2406/astronom...,https://apod.nasa.gov/apod/image/2406/astronom...,image,\nHK (The League of Lost Causes)\n
7,2024-06-10,Sh2-132: The Lion Nebula,Is the Lion Nebula the real ruler of the const...,https://apod.nasa.gov/apod/image/2406/LionNeb_...,https://apod.nasa.gov/apod/image/2406/LionNeb_...,image,\nImran Badr;\nText: Natalia Lewandowska \n(SU...


In [8]:
# NY Times API
# https://developer.nytimes.com/docs/articlesearch-product/1/overview
import requests as r
import pandas as pd
import sys
# Parameters
query = 'outer earth space exploration research'
api_key = 'h4A8g5QWBqw5d9XwN1FVjq5XK7ZNmRXT'
pub_year = 2024
# Construct the API URL
api_ny_url = f'https://api.nytimes.com/svc/search/v2/articlesearch.json?q={query}&api-key={api_key}&begin_date={pub_year}0101&end_date={pub_year}1231'
# Make the GET request
response = r.get(api_ny_url)
if response.status_code == 200:
    ny_data = response.json()
else:
    print('ERROR. Something failed')
    sys.exit()
# Extract the articles
ny_times_df = pd.DataFrame(ny_data['response']['docs'])
ny_times_df.to_csv('csv-files/ny_times.csv', index=False)
ny_times_df.head(10)


Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name
0,Earth’s stratosphere has never seen the amount...,https://www.nytimes.com/2024/01/09/science/roc...,Earth’s stratosphere has never seen the amount...,The high-altitude chase started over Cape Cana...,D,1.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'The New Space Race Is Causing New Po...,"[{'name': 'subject', 'value': 'Space and Astro...",2024-01-09T07:59:25+0000,article,Science,Science,"{'original': 'By Shannon Hall', 'person': [{'f...",News,nyt://article/d3050ab6-aceb-5a72-aa9a-aeab49ae...,2172,nyt://article/d3050ab6-aceb-5a72-aa9a-aeab49ae...,
1,STEM can’t solve all our problems.,https://www.nytimes.com/2024/04/05/opinion/nas...,STEM can’t solve all our problems.,The window to apply to be a NASA astronaut — a...,SR,8.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'The Next Frontier? Philosophy in Spa...,"[{'name': 'subject', 'value': 'Space and Astro...",2024-04-05T09:02:10+0000,article,OpEd,Opinion,"{'original': 'By Joseph O. Chapa', 'person': [...",Op-Ed,nyt://article/151aa2d4-6761-5811-98fa-105c782f...,904,nyt://article/151aa2d4-6761-5811-98fa-105c782f...,
2,Commercial moon landings will change how we lo...,https://www.nytimes.com/2024/01/21/opinion/moo...,Commercial moon landings will change how we lo...,The moon stands alone. It is unique in the kno...,SR,10.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'What We Do to the Moon Will Transfor...,"[{'name': 'subject', 'value': 'Private Spacefl...",2024-01-21T14:00:09+0000,article,OpEd,Opinion,"{'original': 'By Rebecca Boyle', 'person': [{'...",Op-Ed,nyt://article/bdb1808b-0bba-5b20-8543-1f7d5378...,1496,nyt://article/bdb1808b-0bba-5b20-8543-1f7d5378...,
3,He helped send the twin spacecraft on their wa...,https://www.nytimes.com/2024/06/14/science/spa...,He helped send the twin spacecraft on their wa...,"Edward C. Stone, the visionary physicist who d...",,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","{'main': 'Edward Stone, 88, Physicist Who Over...","[{'name': 'subject', 'value': 'Space and Astro...",2024-06-14T23:03:17+0000,article,Obits,Science,"{'original': 'By Sam Roberts', 'person': [{'fi...",Obituary (Obit),nyt://article/0d21ac7b-59a8-533c-a00b-9c31a39d...,951,nyt://article/0d21ac7b-59a8-533c-a00b-9c31a39d...,Space & Cosmos
4,NASA is conducting tests on what might be the ...,https://www.nytimes.com/2024/02/25/magazine/ma...,NASA is conducting tests on what might be the ...,Alyssa Shannon was on her morning commute from...,MM,28.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Can Humans Endure the Psychological ...,"[{'name': 'organizations', 'value': 'National ...",2024-02-25T10:04:08+0000,article,Magazine,Magazine,"{'original': 'By Nathaniel Rich', 'person': [{...",News,nyt://article/cc4d4b64-5e15-5bb9-afe0-578f67df...,6219,nyt://article/cc4d4b64-5e15-5bb9-afe0-578f67df...,
5,"“He always was a genius,” Herbie Hancock says ...",https://www.nytimes.com/2024/07/03/arts/music/...,"“He always was a genius,” Herbie Hancock says ...","This month we feature Wayne Shorter, the icono...",,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': '5 Minutes That Will Make You Love Wa...,"[{'name': 'subject', 'value': 'Jazz', 'rank': ...",2024-07-03T09:02:14+0000,article,Culture,Arts,"{'original': 'By Marcus J. Moore', 'person': [...",News,nyt://article/eea60cd1-8bf0-5fe2-b5c0-f659fe4a...,3689,nyt://article/eea60cd1-8bf0-5fe2-b5c0-f659fe4a...,Music


### Web Scrapping

To load data from a website using _web scraping_ into a DataFrame, you can use the __BeautifulSoup__ library in combination with the __requests__ library. 

In [10]:
# web scraping from https://www.nytimes.com/2024/01/21/opinion/moon-commercial-companies-transform.html
# Install the en_core_web_sm model
#!python -m spacy download en_core_web_lg
import requests
import pandas as pd
import numpy as np
import spacy
import re
from bs4 import BeautifulSoup

# Define a function to count words using regex
def count_words(text:str)->int:
    return len(re.findall(r'\w+', text))

def get_keywords(text:str, nlp_model)->str:
    document = nlp_model(text)
    context_keywords = ['space', 'moon', 'planet', 'galaxy', 'cosmos', 'rocket', 'astronomy']
    temp_keywords = [token for token in document if token.pos_ in ('NOUN', 'PROPN')]
    keywords = [(token.text).lower() for token in temp_keywords if any(token.similarity(nlp_model(keyword)) > 0.6 for keyword in context_keywords)]
    return ', '.join(keywords)
# ------------ Web Scraping ------------

# URL of the website to scrape
url = 'https://www.nytimes.com/2024/01/21/opinion/moon-commercial-companies-transform.html'
# Headers to mimic a browser visit
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
# Send a GET request to the website
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
else:
    print('Error. pailas')
    sys.exit(1)
# Extract the article content
paragraphs = soup.find_all('p')
data = []
for paragraph in paragraphs:
    data.append(paragraph.get_text())
# Create a DataFrame from the extracted data
scrapping_df = pd.DataFrame(data, columns=['Content'])
scrapping_df = scrapping_df[scrapping_df['Content'].apply(count_words)>10]
scrapping_df = scrapping_df[~scrapping_df['Content'].str.contains(r'Thank you for your patience', case=False, regex=True)]
# Make simple NLP analysis
nlp = spacy.load('en_core_web_lg')
scrapping_df['keywords'] = scrapping_df['Content'].apply(lambda p: get_keywords(p, nlp))

# Print the DataFrame
scrapping_df.to_csv('csv-files/scrapping_ny_times.csv')
scrapping_df.head()

Error. pailas


SystemExit: 1

# Data Pre-Processing

__Pre-processing__ data is an essential step in the _data science workflow_. It involves _transforming raw data_ into a clean and structured format that is suitable for _analysis and modeling_. The __pre-processing__ process typically includes several steps such as _data cleaning_, _data integration_, _data transformation_, and _data reduction_.

_Data cleaning_ involves handling missing values, outliers, and inconsistencies in the data. _Missing values_ can be imputed or removed depending on the nature of the data and the analysis requirements. _Outliers_, which are extreme values that deviate from the normal distribution, can be detected and treated accordingly. Inconsistencies in the data, such as conflicting values or duplicate records, need to be resolved to ensure data integrity.

_Data integration_ involves combining data from multiple sources into a unified dataset. This step may require resolving differences in data formats, units, or naming conventions. It is important to ensure that the integrated data is consistent and accurate.

_Data transformation_ involves converting data into a suitable format for analysis. This may include _scaling numerical variables_, _encoding categorical variables_, or creating new derived features. Scaling ensures that variables are on a similar scale, which is important for certain algorithms. Encoding categorical variables converts them into numerical representations that can be processed by machine learning algorithms. Creating derived features involves extracting meaningful information from existing variables or combining multiple variables to capture complex relationships.

_Data reduction techniques_ are used to reduce the dimensionality of the dataset while _preserving important information_. This is particularly useful when dealing with high-dimensional data or when computational resources are limited. Techniques such as _feature selection_ and _feature extraction_ can be applied to identify the most relevant variables or to create new variables that capture the essence of the data.

Overall, __pre-processing data__ is a critical step in data science as it ensures the quality and usability of the data for analysis and modeling tasks. By carefully handling missing values, outliers, inconsistencies, and transforming the data appropriately, _data scientists can obtain reliable insights_ and build accurate predictive models.

### Data Profiling

__Data profiling__ is an essential process in _data science_ that involves _analyzing_ and _understanding_ the characteristics of a dataset. It provides _valuable insights_ into the quality, structure, and content of the data, enabling data scientists to make informed decisions during the _data analysis_ and modeling stages.

During __data profiling__, various _statistical measures_ and techniques are applied to gain a comprehensive _understanding_ of the dataset. This includes examining the _data types_, _identifying missing values_, detecting outliers, assessing data distributions, and exploring _relationships between variables_. By performing these analyses, data scientists can uncover patterns, trends, and anomalies within the data.

In [14]:
import pandas as pd
import numpy as np

df = pd.read_csv('csv-files/call_center_comments.csv')
print(df.head(8))

# 1. change Data Types
df['attention_category'] = df['attention_category'].astype('category')
df['region'] = df['region'].astype('category')
df['country'] = df['country'].astype('category')
df['product'] = df['product'].astype('category')
df['date_time'] = pd.to_datetime(df['date_time'])
df['attention_time'] = df['attention_time'].apply(lambda at: np.round(at, 2))

# 1.1 Change categorical to numerical
df['attention_category_num'] = df['attention_category'].cat.codes
df['region_num'] = df['region'].cat.codes
df['country_num'] = df['country'].cat.codes

print('='*40, 'Basic Information:')
print(f'Shape: {df.shape}')
print(f'Columns: {df.columns.tolist()}')
print(f'DataTypes:\n{df.dtypes}\n')

# 2. Descriptive Statistics
print('='*40, 'Descriptive Statistics:')
print(df.describe(), '\n')

# 3. Missing Values
print('='*40, 'Missing Values:')
print(df.isnull().sum())

# 4. Value Counts (for a categorical column named 'category_column')
print('='*40, 'Value Counts for "attention_category":')
print(df['attention_category'].value_counts())
print('='*40, 'Value Counts for "region":')
print(df['region'].value_counts())

# 5. Correlation
print('='*40, 'Correlation:')
print(df.select_dtypes(include=['number']).corr())


          code           client              product            date_time  \
0  JLz-1254574  Terri Valentine      USB Flash Drive  2021-12-27 21:56:53   
1  YsR-3166466   Jessica Powell               Tablet  2024-06-27 10:11:33   
2  QeH-0295056     Dana Hensley   Portable Projector  2022-02-06 21:59:15   
3  aUd-2224033         Amy Kent  External Hard Drive  2020-10-18 05:14:15   
4  nvp-6413002     Andrea Jones   Portable Projector  2022-10-01 03:44:58   
5  tey-2479448  Anthony Shaffer  External Hard Drive  2023-07-23 12:18:16   
6  HHA-4943747       Erin Smith                Drone  2020-10-11 02:56:22   
7  CkW-2811762     Johnny White          Smart Watch  2020-06-15 06:54:37   

   attention_time                                  comment country_of_origin  \
0       41.675933            Write speeds could be faster.            France   
1       72.571054  Great for reading and streaming videos.             India   
2        8.573219            Battery life could be longer.        

'\n# 1. change Data Types\ndf[\'attention_category\'] = df[\'attention_category\'].astype(\'category\')\ndf[\'region\'] = df[\'region\'].astype(\'category\')\ndf[\'country\'] = df[\'country\'].astype(\'category\')\ndf[\'product\'] = df[\'product\'].astype(\'category\')\ndf[\'date_time\'] = pd.to_datetime(df[\'date_time\'])\ndf[\'attention_time\'] = df[\'attention_time\'].apply(lambda at: np.round(at, 2))\n\n# 1.1 Change categorical to numerical\ndf[\'attention_category_num\'] = df[\'attention_category\'].cat.codes\ndf[\'region_num\'] = df[\'region\'].cat.codes\ndf[\'country_num\'] = df[\'country\'].cat.codes\n\nprint(\'=\'*40, \'Basic Information:\')\nprint(f\'Shape: {df.shape}\')\nprint(f\'Columns: {df.columns.tolist()}\')\nprint(f\'DataTypes:\n{df.dtypes}\n\')\n\n# 2. Descriptive Statistics\nprint(\'=\'*40, \'Descriptive Statistics:\')\nprint(df.describe(), \'\n\')\n\n# 3. Missing Values\nprint(\'=\'*40, \'Missing Values:\')\nprint(df.isnull().sum())\n\n# 4. Value Counts (for a categ

### Data Cleaning

__Data cleaning__ is a crucial step in the _data science process_. It involves identifying and _correcting errors_, _inconsistencies_, and _inaccuracies_ in the dataset to ensure its _quality_ and reliability. The process typically includes _handling missing values_, _removing duplicates_, _dealing with outliers_, and resolving inconsistencies in _data formats_ or units.

_Handling missing_ values is an important aspect of _data cleaning_. _Missing values_ can occur due to various reasons such as _data collection errors_ or incomplete records. Strategies for handling missing values include _imputation_, where missing values are replaced with estimated values based on statistical techniques, or deletion, where rows or columns with missing values are removed from the dataset.

_Removing duplicates_ is another key task in __data cleaning__. Duplicates can arise from data entry errors or data merging processes. Identifying and _removing duplicate records_ ensures that each observation in the dataset is unique and avoids bias in subsequent analyses.

In [17]:
import numpy as np

## Data Cleaning
df['city'] = df['city'].fillna('Unknown').astype('category')
df['attention_time'] = df['attention_time'].fillna(np.round(np.mean(df['attention_time']), 2))

# Drop duplicates
df = df.drop_duplicates()
df = df.dropna()

# Reset index
print('='*40, 'Data Cleaning without null or duplicate values:\n')
df = df.reset_index(drop=True)
print(df.info())

# Removing outliers using mean and standard deviation
print('='*40, 'Removing outliers using z-score:')
z_threshold = 2.5
df_cleaned = df.copy()
for column in df.columns:
    if df[column].dtype != 'float64':
        continue
    print(column, end='--- ')
    column_zscore = (df[column] - df[column].mean()) / df[column].std()
    df_cleaned = df_cleaned[np.abs(column_zscore) <= z_threshold]

print('Before:', df['attention_time'].describe())
print('After:', df_cleaned['attention_time'].describe())

df = df_cleaned.copy()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 980227 entries, 0 to 980226
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   code               980227 non-null  object  
 1   client             980227 non-null  object  
 2   product            980227 non-null  object  
 3   date_time          980227 non-null  object  
 4   attention_time     980227 non-null  float64 
 5   comment            980227 non-null  object  
 6   country_of_origin  980227 non-null  object  
 7   city               980227 non-null  category
dtypes: category(1), float64(1), object(6)
memory usage: 53.3+ MB
None
attention_time--- Before: count    980227.000000
mean         77.462081
std          40.778261
min           5.000052
25%          43.105321
50%          77.460000
75%         111.782405
max         149.999933
Name: attention_time, dtype: float64
After: count    980227.000000
mean         77.462081
std          40.77

# Exploratory Data Analysis

__Exploratory Data Analysis__ (EDA) is a crucial process in data science that involves _examining and understanding_ the characteristics of a _dataset_. It serves as a foundation for further analysis and modeling tasks. 

During __EDA__, data scientists employ various techniques to _gain insights_ into the data. This includes summarizing the main features and statistics of the dataset, visualizing the data through plots and charts, and identifying patterns and relationships between variables. 

Overall, __EDA__ plays a vital role in understanding the data, _formulating hypotheses_, and _generating insights_ that drive the subsequent steps in the _data science workflow_. It helps in making informed decisions, validating assumptions, and building robust models that can effectively solve real-world problems.

### Pandas Exploration

In [31]:
# column names
print("Columns:", df.columns.tolist())

target = 'attention_category_num'
df.select_dtypes(include=['number']).corr()[target].sort_values().drop(target).plot(kind='bar')

Columns: ['code', 'client', 'product', 'date_time', 'attention_time', 'comment', 'country_of_origin', 'city']


KeyError: 'attention_category_num'

In [19]:
#!pip install ydata_profiling
from ydata_profiling import ProfileReport
profile_obj = ProfileReport(df, title='Call Center Comments Profile')
profile_obj.to_file('html-files/call_center_comments_profile.html')
profile_obj

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



### Extract-Transform-Load Pipelines



In [21]:
# Extract-Transform-Load Pipelines
# Step 1: Import Required Libraries
import pandas as pd
# Step 2: Define Functions for DataFrame Manipulation
def merge_dataframes(df):
    students_csv_df = pd.read_csv('csv-files/StudentsInfo.csv')
    students_csv_df.columns = students_csv_df.columns.str.lower()
    students_json_df = pd.read_json('json-files/StudentsInfo.json')
    students_json_df.columns = students_json_df.columns.str.lower()
    result_df = pd.merge(students_csv_df, students_json_df, on='name', how='inner')
    return result_df

def clean_df(df):
    df = df.fillna('Lost Values')
    df = df.drop_duplicates()
    return df

def pretty_presentation(df:pd.DataFrame) -> pd.DataFrame:
    """This function formats the dataframe for pretty presentation
    Args:
        df (DataFrame): Input reference dataframe
    Returns:
        A formatted dataframe"""
    df.columns = df.columns.str.title()
    df = df.set_index('Name')
    columns = ['Position', 'Career', 'Company', 'College', 'Salary']
    df = df[columns]
    return df

def save_students_info(df: pd.DataFrame) -> pd.DataFrame:
    """This function saves the students' information in a CSV file
    Args:
        df (DataFrame): Input reference dataframe
    Returns:
        A dataframe with the students' information"""
    df.to_csv('csv-files/students_final.csv')
    return df

# Step 3: Create the Pipeline
final_df = (pd.DataFrame().pipe(merge_dataframes).pipe(clean_df).pipe(pretty_presentation).pipe(save_students_info))
final_df.head()

Unnamed: 0_level_0,Position,Career,Company,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alice Johnson,Petroleum engineer,Computer Science,"Hernandez, Griffith and Nelson",Tech University,4740
David Jones,"Geologist, engineering",Biology,Gomez-Garcia,Science College,73329
Eva Brown,Microbiologist,Physics,Blevins LLC,Tech University,83245
Frank Davis,Museum education officer,Chemistry,Greene-Wilson,Science College,74390
Jack Anderson,"Scientist, research (maths)",Software Engineering,Butler PLC,Tech University,69851
