### Introduction

What do we mean by Machine Learning?
- Learning from data and finding patterns without being explicity programmed. 

Three broad categories:
- Supervised learning
- Unsupervised learning
- Reinforcement learning - learning by maximizing a reward function, for example training of self-driving from using feedback from the environment.

In this workshop, we will be focusing on the two supervised learning tasks: 
- Classification 
- Regression

How to approach a problem? Broadly two parts:
- Data exploration and feature engineering
- Model building, tuning and testing  

### Today's session
Topics to be covered today:  
- Pandas dataframes as the data structure for datasets
- Converting csv files to dataframes 
- Slicing and indexing dataframes using conditionals as well as iloc and loc methods.
- Statistical summary and exploration of dataframes
- Detecting and filling missing values in the dataframes 
- Regular expressions for data extraction
- Feature engineering such as creating new features 
- Basic plots
- Correlation among features
- Basic operations such as dropping rows/columns, setting index, replacing values of a column using a dictionary, etc.

Structure for today's workshop:
- Introduction (20 min)
- Guided session (30 min)
- Hands-on exercise (45 min)
- Project work (20 min)

At the end, you will pick a dataset that you will be working on in the consecutive sessions. It would be great to work on datasets in-between the sessions and afterwards as and when you find time. You are also encouraged to explore multiple datasets instead of just one.

Let us start with the guided session. Please find the links below:
- Guided session 1
- Exercise 1

We recommend using Kaggle Kernels - cloud computation platform from Kaggle - for the guided session and exercise. Please use the above links and fork the notebooks and work with them. If you want to use your own laptop, please download the material from the [Github repository for the workshop](https://github.com/AashitaK/ML-Workshops) and install Jupyter notebook using Anaconda installation.  Please ask for help from anyone of us, if you run into trouble with running the notebooks.

### Picking a dataset
If you are complete begineer with limited time to spare, these two datasets are a great choice to learn from with minimal effort. You can pick both, one for classification and other for regression.
* [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)
* [House Prices: Predict sales prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)  

If you are a beginner willing to dive in and be changelled, these are some good ones:
* [Predict Future Sales](https://www.kaggle.com/c/competitive-data-science-predict-future-sales)
* [Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand/data)

These datasets are great if you can bear with the initial frustation of finding your way and willing to put in time and effort: 
* [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration/data)
* [New York City Taxi Fare Prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction)
* [Reducing Commercial Aviation Fatalities](https://www.kaggle.com/c/reducing-commercial-aviation-fatalities/data)
* [Instacart Market Basket Analysis](https://www.kaggle.com/c/instacart-market-basket-analysis/data)
* [Mercari Price Suggestion Challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge/data)

The list is not exhaustive. [Kaggle Competitions](https://www.kaggle.com/competitions) (past and current) as well as [Kaggle Datasets](https://www.kaggle.com/datasets) are an excellent resource to find datasets. Below are a few suggestions to note if you want to pick a dataset from outside the above list:
* Data must be in tabular format (csv files). We will cover other data formats in Deep Learning workshop series later on. Please don’t pick tabular data stored in Google BigQuery format as it is usually too  big to work with.
* Prefer past competitions that have a high participation, a lot of shared kernels and many topics in the discussion forum to learn from.
* Main features must not be textual unless you already know or plan to learn some natural language processing concepts within the time frame.
* Data size must be manageable with respect to the computation power you have access to, especially important to check for the recent competitions. You can also run notebooks in [Google Colab](https://research.google.com/colaboratory/faq.html) for free.
* It is better to pick a [competition dataset](https://www.kaggle.com/competitions) than one from the [dataset archive](https://www.kaggle.com/datasets) unless you have an idea about how to formulate the problem and choose evaluation metrics. Must check with a quick baseline model that the performance is not inappreciable for the formulated problem with respect to the decided metrics. 

