# Fall 2020: DS-GA 1011 NLP with Representation Learning
## Lab 1: 04-Sep-2020, Friday
## Introduction

### Pre-requisites
1. Python 3.7+
2. Virtual Environment
3. Jupyter Notebook/Lab

### Resources
1. [Python tutorial](https://docs.python.org/3.7/tutorial/)
2. [Python documentation](https://docs.python.org/3.7/)
2. [Python Introduction through CogSci exercise](https://colab.research.google.com/drive/1ghPQaTEdO9UH4s3gGD5OXmkYNvIwm2Zi)

### Environments
- [Local: Anaconda](https://docs.anaconda.com/anaconda/install/)
- [HPC Cluster](https://www.nyu.edu/life/information-technology/research-and-data-support/high-performance-computing.html)
- Cloud Platforms

---
### Getting Started (Local)

`conda env list`

`conda activate <env_name>`

`conda/pip install <package_name>`

---
### [Juptyer Notebook/Lab](https://jupyter.org)
Open-source web application for creating documents that contain live code, equations, visualizations and narrative text. JupyterLab is enhanced web-based interactive development interface.

#### Cell Types

- Markdown: Allows adding text, images, and latex equations to the notebook. Also enables text formatting, embedding code, creating tables etc. [Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). **Help-->Markdown/Markdown Reference**
    

- Code: As the name suggest, where the code is written in supported language to generate output in text or images. Can use shell (using !) and [magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html) (using % or %%).
    

- Raw: Render different code formats. More information [here](https://nbsphinx.readthedocs.io/en/0.7.1/raw-cells.html).

#### Help
- Keyboard Shortcuts
- Reference Documentation

#### Kernel
- Install: `python -m ipykernel install --user --name=<env_name>`
- List: `jupyter kernelspec list`
- Uninstall: `jupyter kernelspec uninstall <env_name>`

In [1]:
# Shell command
!pwd

/Users/jk/GitHub/NYU-CDS/2020.1011


In [2]:
!which python

/opt/anaconda3/envs/1011/bin/python


In [3]:
# List magics
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%

In [4]:
# Code
a = 2
b = 3
c = a+b
print(a, '+', b, '=', c)

2 + 3 = 5


In [5]:
# Help
?c

[0;31mType:[0m        int
[0;31mString form:[0m 5
[0;31mDocstring:[0m  
int([x]) -> integer
int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments
are given.  If x is a number, return x.__int__().  For floating point
numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base.  The literal can be preceded by '+' or '-' and be surrounded
by whitespace.  The base defaults to 10.  Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int('0b100', base=0)
4


---
### [Numpy](https://numpy.org/doc/stable/reference/?v=20200903223413)
Scientific computing package that provides array computing features like indexing, vectorization & broadcasting to perform fast mathematical operations through pre-compiled C code.

In [6]:
import numpy as np

In [7]:
x = [1,2,3]

In [8]:
type(x)

list

In [9]:
np.array(x)

array([1, 2, 3])

In [10]:
# Array creation functions
print('Zeros\n', np.zeros((3, 3)))
print('Ones\n', np.ones((3, 3)))
print('Identity\n', np.identity(3))
print('Diagonal\n', np.diag(np.array([1, 2, 3]))) # diagonal matrix
print('Range\n', np.arange(9).reshape(3,3))
print('Random\n', np.random.rand(3, 3))

Zeros
 [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
Ones
 [[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
Identity
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Diagonal
 [[1 0 0]
 [0 2 0]
 [0 0 3]]
Range
 [[0 1 2]
 [3 4 5]
 [6 7 8]]
Random
 [[0.60375274 0.85615272 0.16575488]
 [0.8586009  0.2793782  0.88931601]
 [0.16590221 0.56726676 0.29801778]]


In [11]:
arr = np.arange(9).reshape(3,3)
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [12]:
# Slicing
arr[:,2]

array([2, 5, 8])

In [13]:
# Broadcasting
arr + 5

array([[ 5,  6,  7],
       [ 8,  9, 10],
       [11, 12, 13]])

In [14]:
# Vectorization
def myfunc(a, b):
    "Return a-b if a>b, otherwise return a+b"
    if a > b:
        return a - b
    else:
        return a + b

In [15]:
vfunc = np.vectorize(myfunc)
vfunc([1, 2, 3, 4], 2)

array([3, 4, 1, 2])

---
### [Pandas](https://pandas.pydata.org/pandas-docs/stable/?v=20200903223413)

Data analysis library that provides easy-to-use data structures for the Python programming language.

In [16]:
import pandas as pd

In [17]:
index = ['a','b','c','d','e']
series = pd.Series(np.arange(5), index=index) 
print(series)

a    0
b    1
c    2
d    3
e    4
dtype: int64


#### Accessing rows

In [18]:
series[['a', 'c']]

a    0
c    2
dtype: int64

In [19]:
#we can also use data to access the elements
series[2]

2

In [20]:
#Slicing
series['b':'e']

b    1
c    2
d    3
e    4
dtype: int64

In [21]:
#filtering the given data
filter = series >=3
series[filter]

d    3
e    4
dtype: int64

#### Reindexing

In [22]:
s = pd.Series(['blue', 'purple', 'red'], index=[0,2,4]) #automatic alignment
print(s)
s.reindex([2,3,1,0]) 

0      blue
2    purple
4       red
dtype: object


2    purple
3       NaN
1       NaN
0      blue
dtype: object

In [23]:
#s.reindex(range(5), fill_value='black')
s.reindex(range(5), method='ffill')

0      blue
1      blue
2    purple
3    purple
4       red
dtype: object

In [24]:
index1 = ['a','b','c']
index2 = ['a','b','d','e']
#sdata = {'b':100, 'c':150}
s1 = pd.Series([1,2,3],index =index1 )
s2 = pd.Series([3,4,5,7], index = index2)

In [25]:
s1+s2
# s1*2

a    4.0
b    6.0
c    NaN
d    NaN
e    NaN
dtype: float64

#### Ranking elements

In [26]:
s = pd.Series([7, -5, 7, 4, 2, 0, 4])
s.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [27]:
s.rank(method= 'first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

#### Creating a dataframe

In [28]:
data = pd.DataFrame({ 'children': [4., 6, 3, 3, 2, 3, 5, 4],
                     'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                     'salary':   [90, 24, 44, 27, 32, 59, 36, 27]})

In [29]:
# dataframes are nicely displayed without the print command
print(data)
data

   children   pet  salary
0       4.0   cat      90
1       6.0   dog      24
2       3.0   dog      44
3       3.0  fish      27
4       2.0   cat      32
5       3.0   dog      59
6       5.0   cat      36
7       4.0  fish      27


Unnamed: 0,children,pet,salary
0,4.0,cat,90
1,6.0,dog,24
2,3.0,dog,44
3,3.0,fish,27
4,2.0,cat,32
5,3.0,dog,59
6,5.0,cat,36
7,4.0,fish,27


In [30]:
print(data.salary)
print(data[['pet','salary']])

0    90
1    24
2    44
3    27
4    32
5    59
6    36
7    27
Name: salary, dtype: int64
    pet  salary
0   cat      90
1   dog      24
2   dog      44
3  fish      27
4   cat      32
5   dog      59
6   cat      36
7  fish      27


#### Creating and removing columns

In [31]:
#adding the column is very easy
data['Other Pets']= np.nan

In [32]:
data

Unnamed: 0,children,pet,salary,Other Pets
0,4.0,cat,90,
1,6.0,dog,24,
2,3.0,dog,44,
3,3.0,fish,27,
4,2.0,cat,32,
5,3.0,dog,59,
6,5.0,cat,36,
7,4.0,fish,27,


In [33]:
data.drop(['Other Pets'], axis=1)

Unnamed: 0,children,pet,salary
0,4.0,cat,90
1,6.0,dog,24
2,3.0,dog,44
3,3.0,fish,27
4,2.0,cat,32
5,3.0,dog,59
6,5.0,cat,36
7,4.0,fish,27


#### Accessing elements with _iloc_ and _loc_

In [34]:
print(data.iloc[3:5,0:2])
print(data.loc[data['salary']>50])

   children   pet
3       3.0  fish
4       2.0   cat
   children  pet  salary  Other Pets
0       4.0  cat      90         NaN
5       3.0  dog      59         NaN


---
# References
DS-GA 1007 Programming for Data Science