Skip to content

EwelinaSwiderska/DataAnalysisWithPython

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Data Analysis with Python

Description


I have been given 3 different data files(excel, csv, json) from Camdens Council about trees and their environment in the local area to determine if they can create 3 new initiatives:
* List of all trees in the borough
* Series of “Tree Walks” brochures with informations about interesting trees and parks locations in the area
* Enviroment Report of the total carbon and pollution benefit provided by all their trees with information about trees removed, trees planted and the net carbon and pollution impact of this activity


Additional informations:


* Trees data set is for public use and owned by Council(downloaded from website, excel file)
* Common Names data set is for public use and owned by Holiticultural Website(scraped from website, json file)
* Environment data set is for internal use and owned by Council(extracted from database, csv file)

Analysis walk-through:


⇨ Uploading three different data files(excel/ csv/ json) to our notebook and performing some simple checks(head/ shape/ columns/ dtypes) using Pandas Library with each data set to gain some knowlage about our data


Screenshot 2023-04-28 at 15 31 20 Screenshot 2023-04-28 at 15 31 48 Screenshot 2023-04-28 at 15 29 34 Screenshot 2023-04-28 at 15 30 15



⇨ Performing further data checks (value_counts/ describe/ unique) to understand with what kind of data we'll dealing - qualitative(nominal/ ordinal/ binary) or quantitative (discrete/ continous). Wider check of our float data types to look for missing values. Classifying data type in all columns from all data sets


Screenshot 2023-04-28 at 15 43 27 Screenshot 2023-04-28 at 15 43 55 Screenshot 2023-04-28 at 15 45 01



⇨ Checking all data sets for nulls and zero values to find out how big of a problem we having with missing values


Screenshot 2023-04-28 at 15 50 06 Screenshot 2023-04-28 at 15 50 24

⇨ Identifying crazy outliers in all datasets using boxplot to check potential errors


Screenshot 2023-04-28 at 15 52 39 Screenshot 2023-04-28 at 15 52 54



⇨ Identifying duplicates


Screenshot 2023-04-28 at 15 55 51



⇨ Identifying geolocation issues to find data for trees outside our area


Screenshot 2023-04-28 at 15 57 52 Screenshot 2023-04-28 at 15 58 06 Screenshot 2023-04-28 at 15 58 28 Screenshot 2023-04-28 at 15 58 44



⇨ Identyfying unmatched data to find trees without matching environmental data


Screenshot 2023-04-28 at 16 01 46 Screenshot 2023-04-28 at 16 02 00 Screenshot 2023-04-28 at 16 02 21 Screenshot 2023-04-28 at 16 02 32

Conclusions:



✔️ Problem with Data Quality (lots of missing values, outliers which we can classify like errors, some duplicated data, unmached data within data sets)

✔️ Problem with Data Sensitivity(data scraped from website without permission, data from council databases for internal use)

✔️ Not enough informations to complete cauncil initiatives(about park locations, interesting or planted trees)


Additional knowlege:


✔️ Importance of Data Quality

✔️ Importance of Data Sensitivity and Ownership

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors