<a href="https://colab.research.google.com/github/Mjboothaus/getting-started-python/blob/main/GettingStartedPythonAnalytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DataBooth - Some resources for getting started with Python and Analytics

This short document provides a few recommendations for getting started with the Python language and data analytics.

There is **no unique path** to follow - we all learn differently - although I think it is true that most of us learn best by doing i.e. experimenting with code and solving small problems as we go along. Few of us "get the textbook" and work systematically through it!

Unless otherwise indicated, these resources are available for free (or the courses can be taken for free in "audit" mode).

There are many other quality free and paid resources available online, however, these are very good starting places. **Happy problem solving and coding - Enjoy!**


In no particular order...

## Google Colaboratory

That's the platform that this document is hosted on.
Google has plenty of resources to help you learn about using their complimentary, computational notebooks (their version of [Jupyter](https://jupyter.org) notebooks).

See their excellent introductory overview notebook [here](https://colab.research.google.com/notebooks/intro.ipynb).

## IBM courses on Coursera

[Python for Data Science, AI & Development](https://www.coursera.org/learn/python-for-applied-data-science-ai) this course can be done as part of a number of specialities - the following 2 look most relevant:
- [Data Science Fundamentals with Python and SQL Specialization](https://www.coursera.org/specializations/data-science-fundamentals-python-sql)
- [Data Engineering Foundations Specialization](https://www.coursera.org/specializations/data-engineering-foundations)

If I'm being completely frank, it seems to me that _Data Engineering_ skills are likely to be in great demand that _Data Science_ skills moving forward - although there can be significant overlap between the two areas.

## Replit.com

Replit provides a free, on-demand, collaborative, in-browser IDE to code in over 50 languages. Consquently it does not require any explicit setup to get going which makes it a great platform for newbies with a gamified look and feel :)

Create an account here [Replit.com](https://replit.com) and choose a Python project to get going.

There are some Python specfic tutorials embedded within the learning materials on the platform.

## Two (low cost) paid options - Jose Portilla Udemy Courses

- [2022 Complete Python Bootcamp From Zero to Hero in Python](https://www.udemy.com/course/complete-python-bootcamp/)
- [Python for Data Science and Machine Learning Bootcamp](https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/)

I learnt Python from the first of these two courses. Typically you can purchase them on Udemy at significant discount for around AUD $20.

## Articles

### [Towards Data Science](https://towardsdatascience.com/about)

- [9 Free Quality Resources to Learn and Expand Your Python Skills - Learn Python regardless of your technical background](https://towardsdatascience.com/9-free-quality-resources-to-learn-and-expand-your-python-skills-44e0fe920cf4)

### [Medium](https://medium.com)

- TODO: Add more references.

## Tinkerstellar and Juno (options for iOS devices)

If you own an iPad there are some (fairly) new and interesting learning resources. From my testing they feel very solid and are elegant - bit like a glossy magazine style using [Jupyter](https://jupyter.org/) notebooks.

Tinkerstellar is an iOS app that helps you learn coding and data science with interactive tutorials (or labs), where you can edit and run code examples straight away — no need to configure environments, download datasets or rely on networking connection to execute code. Just download a lab, and you’re ready to play around and experiment with code solving real-world problems of computational science 🚀. If you move on from learning in Tinkerstellar there is Juno for more serious work.



### Author

Dr. Michael J. Booth

Founder, DataBooth

michael@databooth.com.au

_p.s. any errors/typos please email me._

---

## Simple example - Loading / Preview of data 

Some Python code to load some sample data from [NSW Open Data](https://www.data.nsw.gov.au).

In [1]:
#pip install -U pandas-profiling

In [2]:
import pandas as pd  # Pandas is the usual library for doing structured data analysis
import pandas_profiling as pp

Example data set

[COVID-19 cases by notification date and postcode, local health district, and local government area.csv (Discontinued)](https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-aefcde60-3b0c-4bc0-9af1-6fe652944ec2/distribution/dist-nsw-ckan-21304414-1ff1-4243-a5d2-f52778048b29/details?q=)

In [3]:
data_URL = "https://data.nsw.gov.au/data/dataset/aefcde60-3b0c-4bc0-9af1-6fe652944ec2/resource/21304414-1ff1-4243-a5d2-f52778048b29/download/confirmed_cases_table1_location.csv"

In [4]:
# Look at first 5 lines of .csv data file
# Example of using Unix command line from within notebook (note the use of '!' before the actual command)

!curl -s $data_URL | head -n 5

notification_date,postcode,lhd_2010_code,lhd_2010_name,lga_code19,lga_name19
2020-01-25,2134,X700,Sydney,11300,Burwood (A)
2020-01-25,2121,X760,Northern Sydney,16260,Parramatta (C)
2020-01-25,2071,X760,Northern Sydney,14500,Ku-ring-gai (A)
2020-01-27,2033,X720,South Eastern Sydney,16550,Randwick (C)


In [5]:
# Load data into a pandas dataframe

data_df = pd.read_csv(data_URL)

In [6]:
data_df.columns.to_list()

['notification_date',
 'postcode',
 'lhd_2010_code',
 'lhd_2010_name',
 'lga_code19',
 'lga_name19']

In [7]:
data_df.head()

Unnamed: 0,notification_date,postcode,lhd_2010_code,lhd_2010_name,lga_code19,lga_name19
0,2020-01-25,2134,X700,Sydney,11300,Burwood (A)
1,2020-01-25,2121,X760,Northern Sydney,16260,Parramatta (C)
2,2020-01-25,2071,X760,Northern Sydney,14500,Ku-ring-gai (A)
3,2020-01-27,2033,X720,South Eastern Sydney,16550,Randwick (C)
4,2020-03-01,2077,X760,Northern Sydney,14000,Hornsby (A)


In [8]:
data_df['postcode'].nunique()

765

In [9]:
data_df.describe()

Unnamed: 0,notification_date,postcode,lhd_2010_code,lhd_2010_name,lga_code19,lga_name19
count,973412,973412,959365,959365,959279,959279
unique,686,765,17,17,130,130
top,2022-01-06,2170,X710,South Western Sydney,11570,Canterbury-Bankstown (A)
freq,41338,19158,167966,167966,68654,68654


In [10]:
data_df.tail(5)

Unnamed: 0,notification_date,postcode,lhd_2010_code,lhd_2010_name,lga_code19,lga_name19
973407,2022-02-07,2283,X800,Hunter New England,14650,Lake Macquarie (C)
973408,2022-02-07,2019,X720,South Eastern Sydney,10500,Bayside (A)
973409,2022-02-07,2076,X760,Northern Sydney,14500,Ku-ring-gai (A)
973410,2022-02-07,2760,X750,Nepean Blue Mountains,16350,Penrith (C)
973411,2022-02-07,2066,X760,Northern Sydney,14700,Lane Cove (A)


In [11]:
data_df.dtypes

notification_date    object
postcode             object
lhd_2010_code        object
lhd_2010_name        object
lga_code19           object
lga_name19           object
dtype: object

## Pandas Data Profiling Report

In [16]:
pp.ProfileReport(data_df, title="Demo: NSW COVID data", correlations=None).to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]