# Lab 1 - Introduction to Python

## Agenda

### Environment Setup (Jupyter Notebook)
   * Anaconda
   * Google Colab

### Git & Github
   * Create a repository
   * Basic commands
### Introduction to Jupyter Notebook
   * Markdown
   * Code
   * Magic Commands
   * Keyboard Shortcuts
### Install packages
   * pip install
### Python Basics
* String
* Regex
* list
* Map & Filter
* Files


### Python Libraries
* Numpy
* Matplotlib
* Pandas
* Seaborn

### EDA (Exploratory Data Analysis)
* Data Cleaning
    * Handling Missing Values
    * Drop unwanted columns
* Data Visualization
    * Types of Plots (Bar, Line, Scatter, Box, Histogram, ..etc) 
    * multi variate analysis
    * Correlation
    * variate analysis
* Data Preprocessing
    * Encoding
    * Scaling
    * Normalization
    * Standardization
* Data Summarization
    * Statistical summary (mean, median, mode, std, min, max)

### NLP Introduction (continued in Lab 2)
* Tokenization
* Stemming
* Punctuation
* Lemmatization
* Stopwords
* POS Tagging
* Named Entity Recognition
* Sentiment Analysis
* Text Classification


#### Author : Ayman Elsayeed
#### Date : 2021-06-01

In [None]:
!pip install nltk
!pip install spacy
!pip install textblob
!pip install gensim
!pip install wordcloud
!pip install matplotlib
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install seaborn

In [80]:
# Import Libraries
import nltk
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
nltk.download('stopwords')

### String

In [2]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

In [3]:
# Return a list of the words

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>text1.split()</code>
    </p>
</details>


In [4]:
# Words that are greater than 3 letters long in a list

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>[w for w in text1.split() if len(w) > 3]</code>
    </p>
</details>

In [5]:
# Capitalized words in text1

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>[w for w in text1.split() if w.istitle()]</code>
    </p>
</details>

In [6]:
# Words in text1 that end in 's'

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>[w for w in text1.split() if w.endswith('s')]</code>
    </p>
</details>

In [7]:
# converts text1 to lowercase.

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>text1.lower()</code>
    </p>
</details>

In [8]:
text2 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'

In [9]:
# find hashtags in text2

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>[w for w in text2.split() if w.startswith('#')]</code>
    </p>
    </details>

In [10]:
# find callouts in text2

In [12]:
text3 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'

In [1]:
# find substrings

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>[w for w in text3.split() if w.startswith('@')]</code>
    </p>
</details>


### Regex

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [14]:
import re
[w for w in text3.split() if re.search('@[A-Za-z0-9_]+', w)]

### Lists

In [15]:
# list of words in text1

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>text1.split()</code>
    </p>
</details>

In [16]:
# filter list

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>[w for w in text1.split() if len(w) > 3]</code>
    </p>
</details>

In [17]:
# apply function to each element in the list

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>[len(w) for w in text1.split()]</code>
    </p>
</details>


In [18]:
# convert list of words to single string

<details>
    <summary>Click to reveal answer</summary>
    <p>
        <code>' '.join(text1.split())</code>
    </p>
</details>

### Files

In [19]:
# open a file

In [20]:
# read single line

In [21]:
# read all lines

In [22]:
# write to a file

In [23]:
# append to a file

In [25]:
# close the file

### Numpy

In [82]:
# Numpy vs List

In [27]:
# create a numpy array

In [28]:
# slicing

In [29]:
# shape

In [30]:
# reshape

In [31]:
# math operations

In [81]:
# vectorized operations

### Pandas

In [33]:
# create a dataframe

In [34]:
# read a csv file

In [35]:
# head

In [36]:
# tail

In [37]:
# info

In [38]:
# describe

In [39]:
# shape

In [40]:
# columns

In [41]:
# indexing

In [42]:
# filtering

In [43]:
# sorting

In [45]:
# missing values

In [46]:
# drop columns

In [47]:
# drop rows

In [48]:
# fill missing values

In [49]:
# unique values

In [50]:
# value counts

In [51]:
# apply function

In [52]:
# apply-map function

### Seaborn

In [53]:
import seaborn as sns

In [54]:
# load dataset

In [55]:
# histogram

In [56]:
# scatter plot

In [57]:
# box plot

In [58]:
# pair plot

In [59]:
# correlation

In [60]:
# heatmap

In [61]:
# bar plot

In [62]:
# line plot

In [63]:
# violin plot

In [64]:
# joint plot

In [65]:
# cat plot

In [66]:
# lm plot

In [67]:
# kde plot

### EDA (Exploratory Data Analysis)

In [68]:
# load dataset

In [69]:
# head

In [70]:
# info

In [71]:
# describe

In [72]:
# shape

In [73]:
# columns

In [74]:
# check missing values

In [78]:
# remove missing values

In [75]:
# check distribution of variables

In [79]:
# drop unwanted columns

In [76]:
# check correlation

In [77]:
# check outliers