In this notebook, we are going to review some of the basics in Python that you partially covered in your introductory classes. We will get started on some very easy exercises and progressively see more complicated topics.

The goal of this short tutorial is to reinforce important basics, and for you to see whether you'll be comfortable using Python throughout the class - if there are any issues, we want to make sure to fix them early on, to get the best learning experience.

Please fill in the document as you go!

## Welcome to our first Jupyter notebook!

This is a markdown, where you can read text written in a "normal" font. These will be interspersed with coding blocks that you can run. 

As you may know, Python is open source code. This means that anyone is free to contribute to writing bits of it and anyone can use it free of charge. A collection of pieces of code is called a package: we install packages so we don't have to rewrite things from scratch every time. We can leverage what other people have already done!

As a consequence, the first commands you will see here deal with package installation. These packages will be installed using "conda install name_of_package". You may also come across the code "pip install name_of_package": these are similar commands. As we are working with Anaconda, the first one is easier to use. But if for some reason, conda install fails to work for you, switch to pip (also, many packages are not available with conda, but only with pip)!

The way it works for most coding languages is that packages only have to be installed *once*. This is similar to downloading an app e.g. on your phone. However, packages generally have to be loaded or imported at the beginning of every session (i.e., every time you re-open anaconda). This is similar to opening the app on your phone. It means that the package is active in the background. 

As a consequence, I am including some "installation commands" at the beginning of this notebook. In the future, you won't need to run those code segments again: you will just have to load them.. However, it doesn't really matter if you forget and run them again - anaconda will check for you whether they are already there, and if they are, will simply not reinstall them.

Note that we are downloading an awful lot of packages here. This is because these packages will come in useful at many point in times in this course, so we may as well install them from the get-go. However, you will only need to have installed `numpy` and `pandas` for the rest of the notebook, so feel free to skip the rest during class if you run into trouble and try installing them after class.

In [4]:
pip install numpy

Collecting numpy
  Downloading numpy-1.23.3-cp310-cp310-macosx_11_0_arm64.whl (13.3 MB)
[K     |████████████████████████████████| 13.3 MB 6.0 MB/s eta 0:00:01     |██████████████████████████▉     | 11.2 MB 6.0 MB/s eta 0:00:01
[?25hInstalling collected packages: numpy
Successfully installed numpy-1.23.3
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install scipy

Collecting scipy
  Downloading scipy-1.9.1-cp310-cp310-macosx_12_0_arm64.whl (29.9 MB)
[K     |████████████████████████████████| 29.9 MB 3.0 MB/s eta 0:00:01
Installing collected packages: scipy
Successfully installed scipy-1.9.1
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install pandas

In [10]:
pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.6.0-cp310-cp310-macosx_11_0_arm64.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pillow>=6.2.0
  Downloading Pillow-9.2.0-cp310-cp310-macosx_11_0_arm64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting fonttools>=4.22.0
  Downloading fonttools-4.37.4-py3-none-any.whl (960 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m960.8/960.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting contourpy>=1.0.1
  Downloading contourpy-1.0.5-cp310-cp310-macosx_11_0_arm64.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.0/226.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[

In [11]:
pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.13.2-cp310-cp310-macosx_11_0_arm64.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pandas>=0.25
  Downloading pandas-1.5.0-cp310-cp310-macosx_11_0_arm64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting patsy>=0.5.2
  Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.7/233.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pytz>=2020.1
  Downloading pytz-2022.4-py2.py3-none-any.whl (500 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m500.8/500.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: pytz, patsy, pandas, statsmodels
Successfully installed p

In [12]:
pip install seaborn

Collecting seaborn
  Downloading seaborn-0.12.0-py3-none-any.whl (285 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.1/285.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: seaborn
Successfully installed seaborn-0.12.0
Note: you may need to restart the kernel to use updated packages.


In [13]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.1.2-cp310-cp310-macosx_12_0_arm64.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting joblib>=1.0.0
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.2.0 scikit-learn-1.1.2 threadpoolctl-3.1.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install graphviz

Note: you may need to restart the kernel to use updated packages.


In [51]:
conda install python-graphviz

Collecting package metadata (current_repodata.json): done
Solving environment: / ^C
failed with initial frozen solve. Retrying with flexible solve.

CondaError: KeyboardInterrupt


Note: you may need to restart the kernel to use updated packages.


In [15]:
pip install pygal_maps_world 

Collecting pygal_maps_world
  Using cached pygal_maps_world-1.0.2.tar.gz (270 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pygal>=1.9.9
  Using cached pygal-3.0.0-py2.py3-none-any.whl (129 kB)
Using legacy 'setup.py install' for pygal_maps_world, since package 'wheel' is not installed.
Installing collected packages: pygal, pygal_maps_world
  Running setup.py install for pygal_maps_world ... [?25ldone
[?25hSuccessfully installed pygal-3.0.0 pygal_maps_world-1.0.2
Note: you may need to restart the kernel to use updated packages.


In [16]:
pip install tweepy

Collecting tweepy
  Using cached tweepy-4.10.1-py3-none-any.whl (94 kB)
Collecting requests<3,>=2.27.0
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting oauthlib<4,>=3.2.0
  Using cached oauthlib-3.2.1-py3-none-any.whl (151 kB)
Collecting requests-oauthlib<2,>=1.2.0
  Using cached requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting charset-normalizer<3,>=2
  Using cached charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Collecting idna<4,>=2.5
  Using cached idna-3.4-py3-none-any.whl (61 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2022.9.24-py3-none-any.whl (161 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
Installing collected packages: urllib3, oauthlib, idna, charset-normalizer, certifi, requests, requests-oauthlib, tweepy
Successfully installed certifi-2022.9.24 charset-normalizer-2.1.1 idna-3.4 oauthlib-3.2.1 requests-2.28.1 requests-oauthlib-1.3.1 tweepy-4.10.1 urllib3-1.26.12
Note: you 

In [17]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [18]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [19]:
pip install splash

Collecting splash
  Using cached splash-3.5-py3-none-any.whl (213 kB)
Collecting qt5reactor
  Using cached qt5reactor-0.6.3-py3-none-any.whl (9.5 kB)
Collecting funcparserlib
  Using cached funcparserlib-1.0.0-py2.py3-none-any.whl (17 kB)
Collecting adblockparser
  Using cached adblockparser-0.7-py2.py3-none-any.whl (13 kB)
Collecting xvfbwrapper
  Using cached xvfbwrapper-0.2.9.tar.gz (5.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting Twisted[http2]>=19.7.0
  Downloading Twisted-22.8.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting typing-extensions>=3.6.5
  Downloading typing_extensions-4.3.0-py3-none-any.whl (25 kB)
Collecting incremental>=21.3.0
  Downloading incremental-21.3.0-py2.py3-none-any.whl (15 kB)
Collecting Automat>=0.8.0
  Downloading Automat-20.2.0-py2.py3-none-any.whl (31 kB)
Collecting zope.interface>=4.4.2
  Downloading zope.in

In [20]:
pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.6.3-py2.py3-none-any.whl (264 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.5/264.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting itemadapter>=0.1.0
  Downloading itemadapter-0.7.0-py3-none-any.whl (10 kB)
Collecting protego>=0.1.15
  Downloading Protego-0.2.1-py2.py3-none-any.whl (8.2 kB)
Collecting pyOpenSSL>=21.0.0
  Downloading pyOpenSSL-22.1.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting service-identity>=18.1.0
  Downloading service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.6.tar.gz (38 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting w3lib>=1.17.0
  Downloading w3lib-2.0.1-py3-none-any.whl (20 kB)
Collecting queuelib>=1.4.2
  Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
C

In [21]:
pip install scrapy-splash

Collecting scrapy-splash
  Using cached scrapy_splash-0.8.0-py2.py3-none-any.whl (27 kB)
Installing collected packages: scrapy-splash
Successfully installed scrapy-splash-0.8.0
Note: you may need to restart the kernel to use updated packages.


In [22]:
pip install nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hCollecting regex>=2021.8.3
  Downloading regex-2022.9.13-cp310-cp310-macosx_11_0_arm64.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.2/287.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.6/96.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: tqdm, regex, click, nltk
Successfully installed click-8.1.3 nltk-3.7 regex-2022.9.13 tqdm-4.64.

In [23]:
pip install praw

Collecting praw
  Using cached praw-7.6.0-py3-none-any.whl (188 kB)
Collecting websocket-client>=0.54.0
  Using cached websocket_client-1.4.1-py3-none-any.whl (55 kB)
Collecting update-checker>=0.18
  Using cached update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting prawcore<3,>=2.1
  Using cached prawcore-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.6.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.4.1
Note: you may need to restart the kernel to use updated packages.


In [24]:
pip install pytrends

Collecting pytrends
  Using cached pytrends-4.8.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
Using legacy 'setup.py install' for pytrends, since package 'wheel' is not installed.
Installing collected packages: pytrends
  Running setup.py install for pytrends ... [?25ldone
[?25hSuccessfully installed pytrends-4.8.0
Note: you may need to restart the kernel to use updated packages.


In [25]:
pip install selenium

Collecting selenium
  Using cached selenium-4.5.0-py3-none-any.whl (995 kB)
Collecting trio~=0.17
  Using cached trio-0.22.0-py3-none-any.whl (384 kB)
Collecting trio-websocket~=0.9
  Using cached trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting outcome
  Using cached outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting sniffio
  Using cached sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting exceptiongroup>=1.0.0rc9
  Using cached exceptiongroup-1.0.0rc9-py3-none-any.whl (12 kB)
Collecting async-generator>=1.9
  Using cached async_generator-1.10-py3-none-any.whl (18 kB)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting wsproto>=0.14
  Using cached wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting PySocks!=1.5.7,<2.0,>=1.5.6
  Downloading PySocks-1.7.1-py3-none-any.whl (16 kB)
Collecting h11<1,>=0.9.0
  Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: sortedcontainers, sniffio, PySocks, outco

# 1. Preambles in a Jupyter Notebook

Python is a pretty "basic" shell. It contains for example commands to do basic arithmetic and basic data structures (such as lists). To complement the existing functionalities provided by Python, we install **packages**. For any package we wish to use in a notebook, we first need to import it, often renaming it in the process to make it shorter to type out. 

You don't need to know what each package does exactly but it can be useful to have a high-level idea: the ones we use in this session are numpy (specializes in **arrays** which are ways of storing data) and pandas (specializes in **dataframes** which again are a way of storing data).

In [27]:
import numpy as np
import pandas as pd
import graphviz 


In general, in preambles of the notebooks I give you, there is also a part that reads in the data that we are going to use throughout the lecture. We will see how we "read in" data later on.

# 2. Print, variable assignment and types, basic operations, lists

All of the concepts we see here should be easy and immediate for you in the long-run (i.e., you should be so familiar with them that you don't need any googling or help using them). 

## 1. Printing

The "print" command enables us to "see" what is going on in the background. We use it in this way: `print("Hello World!")`. Try it out!

Note that you can get round using `print` if you just type `Hello World!` e.g. Try it out!

This only works for the last command you type in though. If you type ` "Hello World!"` and then, in the same block, `"My name is..."`, only the last one appears. Try it out! This is where `print` is particularly useful.

## 2. Variable Assignment and Types

It is very useful to know how to store information into a variable. For example, if the price of a good is 28.75 euros and I don't want to keep remembering this number, I can simply write `p=28.75` and then call `p` throughout. This is known as *variable assignment*: we assign to the variable p the value 28.75. 

Try assigning the integer 104 to the variable `number`. Then print out `number`.

To organize things, Python classifies information into categories and then handles these categories differently. There are many categories but we just focus on three: integers (such as 104), floats (you can view these as decimal numbers), and strings (which are text based). For Python to recognize these categories, you don't have to do much except input your information correctly. You can use `type` to see what this gives you. For example:

In [1]:
number=104
type(number)

int

Now pick your favorite float and string, assign them to some variables and print out their type!

## 3. Basic operations

These are the basic arithmetic operations that you may have to conduct: they include `+` for addition, `-` for substraction, `*` for multiplication, `/` for division and `**` for powers. 

Our first exercise is to take any temperature in Celsius and conver it to Fahrenheit. For example, $0^\circ C \cdot 9/5+32= 32^\circ F$. How much is $15^\circ$ Celsius in Fahrenheit?

Compute the area of a circle of radius 3. (Reminder: $A=\pi R^2$ and use `np.pi` for $\pi$.)

The `%` operation gives the remainder of a division: try `10%2` versus `11%2`.

## 4. Lists

These are the most basic data structures one can encounter in Python. Defining a list L can look something like this `L=[1,2,3]`. A list can take as an entry any type of variable (strings, floats, integers). It can also be taken to be empty if the goal is just to add to it later `L=[]`.

Create a list L that contains: your first name, your last name, your age. Print it out.

In [None]:
l = [1,2,3]

Check that your list has size 3 by running `len()`.

In [None]:
len(l)

Access the second element of your list using `L[1]`. Why is it not `L[2]`? How do I access the last two elements e.g.?

In [None]:
l[1], l[2]

Note that `L[-1]` gives access to the last element.

In [None]:
l[-1]

Change the first element of the list to your second name, or a name you would have liked to have if you don't have a second name. Use the assignment operator (`=`) to do this.

In [None]:
l[0] = 'Junming'
l

Using `.append` add your month of birth (as a string) to your list L.

In [None]:
l.append('09')
l

Suppose I want to add two pieces of information to the list: the class name and number, i.e., I want to add `M=["DTVC", 1]` to the existing list L. Try using `.append()` and `.extend()` to see how they are different. Which one should you use?

In [None]:
L=["Philippe","Blaettchen",29]
M=["DTVC",1]
L.append(M)
L

: 

In [None]:
L=["Philippe","Blaettchen",29]
M=["DTVC",1]
L.extend(M)
L

Note that `.append` and `.extend` modify the *original* list. If you want to create a new variable that contains L with M added on you would use the `+` operator thus:

In [None]:
L=["Philippe","Blaettchen",29]
M=["DTVC",1]
NewList=L+M
print(NewList)

One last operator that might be useful for lists is the ability to count frequencies `.count`. Use `.count` to count the number of occurrences of 5 in the list below.

In [None]:
L=[1,2,4,2,4,5,1,5,0]

Similar functions that can be used are `.sort` which sorts the list for you, `.insert` which inserts an element at a specified index, and `.index` which returns the first index of appearance of a given item.

# 3. Conditions, if...then...else, and loops

It can be very useful to be able to check whether a condition is verified or loop through a list. We give more examples now.

## 1. Conditions

The first condition is simply whether something is equal to something else. We use here `==`: when the left hand side of the double equality is equal to the right hand side, True is returned. Otherwise False is returned. Note that `==` and `=` mean fundamentally different things: `==` is about checking whether two things are the same, `=` assigns a value to whatever is on the left hand side. Note that not equal to is `!=`.

Create a variable `x` equal to 2. Check using `==` whether `x` is equal to 2.

There are also the `and` and `or` operators. The `and` operator returns True if both conditions it links are true. The `or` operator returns True if at least one of the conditions is true.

Try it out for yourself: define a variable `x` equal to `10`. Try both the `and` and `or` operators with conditions x is equal to 10 and x is greater than or equal to 15.

The final condition we see here is the condition `in` which checks whether a given element is in a given structure.
Define the list `L=[3,4,5]` and the variable `x=3` then check whether x is in L. What happens if `x=1`?

## 2. If...then..else

We have seen how to check different conditions on variables. It may be useful sometimes to have some action based on the answer to the condition. This is codified in the if..then..else type routines: if [condition] is satisfied, then [do this], else [do this]. For example:

In [None]:
name = "John"
age = 23
if name == "John" or age == 23:
    print("Your name is John or you are 23 years old.")

else:
    print("Your name mustn't be John")

Two things are **crucial** here: the two dots at the end of if and else and the indentation. If you don't have the tabulation before print this code will not work. Try it out!

Sometimes you may have many conditions to check, in which case you would use elif, in this way:

In [1]:
name = "John"

if name == "Rachel":
    print("Your name is Rachel.")

elif name=="John":
    print("Your name is John.")

else:
    print("Your name mustn't be John, nor Rachel.")

Your name is John.


Your turn: create a list with a number of elements between 3 and 5. Then have an if...then...else type statement, which returns 3 if the list of elements has length 3, 4 if it has length 4, 5 if it has length 5, "there is an error" otherwise.

## 3. For loops

For loops enable us to repeat an operation many times without typing it out at each iteration.

For example, we can use it to sum over all possible items in a list:

In [None]:
L=[2,5,10,4]
sum_L=0

for i in L:
    sum_L=sum_L+i
    
print(sum_L)

The notion of `range` is very useful for for loops. Try the following code to understand what range does.

In [4]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


In [5]:
for i in range(5,10):
    print(i)

5
6
7
8
9


In [6]:
for i in range(0,10,2):
    print(i)

0
2
4
6
8


How could we use `range`, `len(L)` to compute the sum of all elements in the list L above?

One can also have a `while` clause: for example:

In [7]:
count = 0
while count < 5:
    print(count)
    count = count+1 

0
1
2
3
4


This means that while the condition `count<5` is satisfied, we keep executing the command. As soon as it stops being satisfied, we stop.

These for loops can be very useful to create lists. For example, if we want to create a list with all powers of 2 from 1 to 16, we can do so like this:

In [8]:
List_powersof2=[]
for i in range(0,5):
    List_powersof2.append(2**i)

print(List_powersof2)

[1, 2, 4, 8, 16]


Or even more simply:

In [9]:
List_powersof2=[2**i for i in range(0,5)]
print(List_powersof2)

[1, 2, 4, 8, 16]


We can even add conditions, e.g., if we want to remove 8 from the list.

In [10]:
List_powersof2=[2**i for i in range(0,5) if 2**i!=8]
print(List_powersof2)

[1, 2, 4, 16]


Your turn! Create a list with all the powers of 10 up to 100000 (included), removing 100.

# 4. Functions

Functions are essentially like recipes for our computer: we only need to define an input, and we get back an output according to the rules of the recipe.

Take, for example, the conversion from Fahrenheit to Celsius. Of course, we can look up and type the formula every time that we need to convert a temperature. But why not make our lives easier?

Remember that $x^\circ C \cdot 9/5+32= y^\circ F$. How much is are $15^\circ C$, $25^\circ C$, and $30^\circ C$ in Fahrenheit? Lets start by defining a function that converts **any** temperature in Fahrenheit to the correct temperature in Celsius.

In [None]:
def celsius_to_fahrenheit(celsius):
    fahrenheit = celsius * 9/5 + 32
    return fahrenheit

That was easy! We just had to define the input, the recipe itself, and the output. Let's use it now!

In [None]:
celsius_to_fahrenheit(15)

So far, so good. But what if we want to have all the temperatures converted? The key is that we don't have to redefine the formula, we just call it up. Try combining the function with a for-loop.

# 5. Data structures

Data structures play a key role for us: we start with numpy arrays, then panda dataframes (which we will use heavily in the Machine Learning part of the course), and then dictionaries (which we use in Optimization).

## 1. Numpy arrays

Numpy arrays are an alternative to Python lists. As indicated by the name, they are part of the Numpy package. The advantages are that they are fast, easy to work with, and give users the opportunity to perform calculations across entire arrays. Note that elements of the array are accessed in exactly the same way as for lists.

Generally, to create a numpy array, we first write a list and then specify that it is a numpy array. This of course requires the numpy package to be imported. For example:

In [None]:
# Create 2 lists height and weight
height = [1.67,  1.87, 1.82, 1.60, 1.73, 1.85]
weight = [55, 100, 83, 91, 61, 70]

# Create 2 numpy arrays from height and weight
np_height = np.array(height)
np_weight = np.array(weight)

Check what the type of `np_height` is.

Numpy arrays are useful for many things. First off, it is very easy to do computations with them. For example, if you want to create an array `bmi` that contains the BMI (given by $weight/height^2$) then this can easily be done by acting as if the arrays are just numbers. Try it out!

Another very useful function is the ability to filter the array as needed. For example: healthy BMIs are between 18 and 25. We can select in the bmi list the bmis that match by doing simple conditioning. Note that for sets, and becomes `&` and or becomes `|`. Do not forget the parentheses!

In [None]:
(bmi>=18) & (bmi<=25)

Do the same for all the BMI values that are *not* healthy (i.e., the values that are below 18 or above 25).

This tells us which indexes satisfy (or don't) this condition. We can also get the actual values if we wish by doing the following:

In [None]:
bmi[(bmi>=18) & (bmi<=25)]

These give us the values of the bmi that are healthy.

Your turn: the list below corresponds to grades obtained in an MBA course. The students that obtained between 80 and 90 will get a B. Create a numpy array. Find which grades satisfy this condition. How many are there?

In [None]:
Grades=[12,54,82,100,95,62,87,34,29,98,72,36,85,69,81,96]



You can easily construct arrays of ones and zeros by using the commands `np.zeros((1,5))` (which would give you an array of size $1\times 5$ of zeros) and likewise `np.ones((1,5))` (which would give you an array of size $1\times 5$).

In [None]:
print(np.zeros((1,5)))
print(np.ones((1,5)))

 You can also construct arrays that contain a range by using `np.arange()`.

In [None]:
np.arange(20)

That you can then reshape as needed: say you want an array that has 4 rows and 5 columns, we could simply do the following:

In [None]:
np.arange(20).reshape((4,5))

To obtain the shape of the array above, we use `shape`:

In [None]:
np.arange(20).reshape((4,5)).shape

Your turn! Create a numpy array `A` using `np.arange(12)`. Reshape it so that it is of size (4,3). Then create another numpy area `B` using `np.ones(12)`and reshape it so that it is of size (4,3). Try `A+B`, `A*B`, `A.min()`, `A.min(axis=0)`, `A.min(axis=1)`. What happens everytime?

Other functions such as `.max` `.cumsum`, `.sum` function in a similar fashion.

Two important functions that we discuss now is `.all` and `.any`. They are also useful for dataframes. Do you understand what they're doing?

In [None]:
(B==1).all()

In [None]:
(A==1).all()

In [None]:
(A==1).any()

## 2. Panda Dataframes

This is an incredibly important type of data structure and one we will use almost exclusively in the machine learning lectures. We will most often read a dataframe from a .csv file (similar to those you can open in Excel). This will require us to use e.g., `Dataset=pd.read_csv("data.csv")`. Here, we don't do this but use a pre-existing dataset from the seaborn plotting package as an example.

In [172]:
import seaborn as sns
tips=sns.load_dataset("tips")



The reason we use dataframes a lot is because they are formatted in a functional way (and so easy to read). What happens if you just type in `tips`?

In [173]:
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Take a look at the header of tips using `.head()`.

We can also create our own dataframe using a dictionary: we will briefly discuss this when we discuss dictionaries. We will rarely need to do this in our examples.

Dataframes do not treat features (i.e., the columns) and observations (i.e., the rows) in the same way. For example to access column data (e.g., total_bill), it is quite easy. Accessing observations is a bit harder.

Try `tips['total_bill']` and `tips[['total_bill']]`. What is the difference? It may be useful to use `type` to answer the question. We will use `tips[['total_bill']]` in general as this a datatype we master.

Try selecting two columns now, say `total_bill` and `smoker`.

To access observations, we use square brackets but with integers. For example `tips[0:5]` accesses the first 5 observations. How would you access observations 5 through 10?

In [None]:
tips[0:5]

Sometimes it can be complicated to understand what the *index* of our dataframe is. We use `.index` for this. What do we get in this case? Do you find this surprising?

Accessing one observation rather than a range (as done above) can be done using `.loc`: for example `tips.loc[[5]]`. This just returns the line indexed by 5. If you want to return the line corresponding to position 5, you would use `tips.iloc[[5]]`. The command `.loc` and `.iloc` can also be used for the columns. See the example below.

In [None]:
tips.loc[[5]]

In [None]:
tips.iloc[:,1:3]

Using `.loc` or `.iloc`, you can also select many different observations that are not necessarily contiguous, for example, try selecting rows indexed by 5 and 8 using a similar strategy to arrays.

You can also filter in a similar way as done for arrays: for example, if you only want to consider male tippers you would use:

In [174]:
tips[tips["sex"]=="Male"]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.00,Male,No,Sun,Dinner,2
...,...,...,...,...,...,...,...
236,12.60,1.00,Male,Yes,Sat,Dinner,2
237,32.83,1.17,Male,Yes,Sat,Dinner,2
239,29.03,5.92,Male,No,Sat,Dinner,3
241,22.67,2.00,Male,Yes,Sat,Dinner,2


Your turn! Select the rows that correspond to the size of the party being larger or equal to 4.

## inplace为真标识在原数据上操作，为False标识在原数据的copy上操作。

## Now, try using `tips.where(tips["sex"]=="Male")`. What is the difference with the method used above?

In [176]:
# .where(cond, other=nan, inplace=False, axis=None, level=None, errors=‘raise’, try_cast=False, raise_on_error=None)
# 如果 cond 为真，保持原来的值，否则替换为other， inplace为真标识在原数据上操作，为False标识在原数据的copy上操作。
# other must be the same shape as self: other的形状必须与self相同。

tips["sex"]=="Male"

0      False
1       True
2       True
3       True
4      False
       ...  
239     True
240    False
241     True
242     True
243    False
Name: sex, Length: 244, dtype: bool

In [175]:
# 如果 cond 为真，保持原来的值，否则替换为other
# 因为从上结果可知道 index 0 的值为false 然后就替换为NaN
# index 1 为 true 就保留原来的值
tips.where(tips["sex"]=="Male")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,,,,,,,
1,10.34,1.66,Male,No,Sun,Dinner,3.0
2,21.01,3.50,Male,No,Sun,Dinner,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2.0
4,,,,,,,
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3.0
240,,,,,,,
241,22.67,2.00,Male,Yes,Sat,Dinner,2.0
242,17.82,1.75,Male,No,Sat,Dinner,2.0


An important thing to know how to do is how to drop a column from a dataframe. This is done by using `.drop(columns=["name1","name2"])`. Drop the column "time" from the tips dataset.

In [None]:
tips.mean()

There are many, many different functions that can be applied to the dataset: `.describe()` is probably the most useful. We also have things such as `.max`, `.min`, `.median`, etc. Try them out!

We review three last functions: `.join`, `.merge`, and `.groupby` which are both very useful. 

The functions `.join` and `.merge` are quite similar. We first look at `.join`. Run the following example: what is the code doing?

In [None]:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value1': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value2': [5, 6, 7, 8]})
df1.join(df2)
#columns should have different names e.g. value1, value2 for join to work: they are joined based on common index.
#note that the dataframe obtained has the same size as the two initial datasets

In [None]:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})
df1.merge(df2, left_on='lkey', right_on='rkey')
#here, the values are merged based on the values of the dataset: as foo appears twice, we have 4 possible options.

Don't get too caught up in this! Just know that they exist and refer back here if you ever need to use them to determine how you would do so.

We now move onto `.groupby`. This enables us to simplify the datasets considerably by grouping together observations that present some similarities and then applying some operation to it.

In [None]:
tips.groupby(["sex"]).mean()

For example, we observe here that the total bill is generally higher when a man pays than when a woman does. How do the days of the week impact the total bill?

Use `.groupby` and `.count` to find the most observed set-up of size of group and sex in the dataset.

## 3. Dictionaries

Dictionaries are a default data storage type proposed by Python. They have the advantage of being very flexible with only the information we have present. They are hard to read however. Consider the following dictionary.

In [13]:
scientists={"marie curie":["radioactivity",2,3,4], "albert einstein":["relativity",1], "isaac newton":["gravity"]}

In [17]:
scientists['marie curie'][0::3]

['radioactivity', 4]

In [25]:
scientists.get('marie curie')

['radioactivity', 2, 3, 4]

The numbers here are the number of Nobel Prizes (only 4 people have ever received 2). Note that as Isaac Newton was born before Nobel prizes existed, he does not have any. This isn't a problem in dictionaries: we can have as much (or as little) information on one of the entries as possible (i.e., there can be a lot of variability between entries). In a dataframe, for example, this would have to be an empty entry, so we would waste room storing a useless entry.

We will see that this is very useful for optimization. Try calling the "marie curie" entry in the dictionary. How would your proceed? Note that "marie curie", "albert einstein", "isaac newton" are what are called *keys*. We can call them by typing in `scientists.keys()`.

In [22]:
scientists.keys()

dict_keys(['marie curie', 'albert einstein', 'isaac newton'])

In [26]:
scientists.values()

dict_values([['radioactivity', 2, 3, 4], ['relativity', 1], ['gravity']])

We can also use dictionaries to construct dataframes. We first construct a dictionary: `dict={'col1':[information], 'col2':[information]}`. We then use `pd.DataFrame(dict)` to obtain the corresponding dataframe.

Construct a dictionary with columns "capital" and "continent" included. Populate these with the information on your favorite countries. Make this into a dataframe with the index being the countries' names.

In [29]:
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

x = car.keys()

print(x) #before the change

# Add a new item to the original dictionary, and see that the keys list gets updated as well
car["color"] = "white"

print(x) #after the change
car

dict_keys(['brand', 'model', 'year'])
dict_keys(['brand', 'model', 'year', 'color'])


{'brand': 'Ford', 'model': 'Mustang', 'year': 1964, 'color': 'white'}

# 6. Exercises *(Time-permitting or Homework)*

## Exercise 1: 

Consider this list `L= [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]`. Write a routine that takes this list and makes a new list that only has even elements of this list in it. (This can be done in one line.)

In [32]:
L= [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
l = [i for i in L if i%2 == 0]
l

[4, 16, 36, 64, 100]

## Exercise 2:

1. Consider these two lists: `L1 = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]` and `L2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]`. Write a routine that returns a list that contains only the elements that are in common between the lists (possibly with duplicates). This can be done in one line of code. 
2. Write a routine that removes the duplicates from the list obtained (i.e., instead of `[1, 1, 2, 3, 5, 8, 13]` obtain `[1, 2, 3, 5, 8, 13]`). This can be done in 2 lines of code. *(Hard)*

In [66]:
L1 = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
L2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

zipped = zip(L1, L2)  
list(zipped)

[(1, 1),
 (1, 2),
 (2, 3),
 (3, 4),
 (5, 5),
 (8, 6),
 (13, 7),
 (21, 8),
 (34, 9),
 (55, 10),
 (89, 11)]

In [56]:
# zip 是按照index来对比 返回元组 如果元组内的元素相同 就返回其对应的list
[i for i, j in zip(L1, L2) if i == j]

[1, 5]

In [93]:
# 寻找一个list中重复的元素
from collections import Counter
# 返回一组元组的list
counts = dict(Counter(L1))
print(dict(Counter(L1)))
print(list(counts.items()))

# key 代表list中的值 value代表其数量 找到其数量大于1的值 就说明这个值是重复的
duplicated_v = [key for key, value in counts.items() if value > 1]
# 查找出 一个值与其对应的数量
matched_v = {key:value for key, value in counts.items() if value > 1}

print(duplicated_v)
print(matched_v)

{1: 2, 2: 1, 3: 1, 5: 1, 8: 1, 13: 1, 21: 1, 34: 1, 55: 1, 89: 1}
[(1, 2), (2, 1), (3, 1), (5, 1), (8, 1), (13, 1), (21, 1), (34, 1), (55, 1), (89, 1)]
[1]
{1: 2}


In [86]:
# alternative methods
new_list = [i for i in L1 if i in L2]
new_list

[1, 1, 2, 3, 5, 8, 13]

In [45]:
# alternative methods
set(L1).intersection(L2)

{1, 2, 3, 5, 8, 13}

In [43]:
# alternative methods
list(set(L1) & set(L2))

[1, 2, 3, 5, 8, 13]

In [89]:
# 从一个list中去除重复元素, fromkeys可能是运用到了上面 count 的方法 然后只从 dict(Counter(L1)) 中取key的值
from collections import OrderedDict
new_list2 = OrderedDict.fromkeys(new_list)
new_list2 = list(new_list2)
new_list2

[1, 2, 3, 5, 8, 13]

In [96]:
# 也可以使用pandas的 unique函数
import pandas as pd
list(pd.unique(new_list))

[1, 2, 3, 5, 8, 13]

## Exercise 3

1. Write a routine that takes this array `arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])` as input and modifies it so that all odd numbers have been replaced with -1. This should take a line.
2. How would you modify this routine so that you create a *new* array instead of just modifying `arr`? Hint: you may want to use the `np.where()` function which is similar to `.where()` for dataframes. Check out the documentation!

In [128]:
import numpy as np
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print(arr)

arr[arr%2 == 1] = -1
print(list(arr))

arr[arr%2 == 1] *= -1
print(list(arr))

[0 1 2 3 4 5 6 7 8 9]
[0, -1, 2, -1, 4, -1, 6, -1, 8, -1]
[0, 1, 2, 1, 4, 1, 6, 1, 8, 1]


In [154]:
# 剥洋葱 
a = np.where(arr % 2 == 1)
print(a)

a = np.array(a).tolist()
print(a)

a = a[0]
print(a)

print(a[1:4])


(array([1, 3, 5, 7, 9]),)
[[1, 3, 5, 7, 9]]
[1, 3, 5, 7, 9]
[3, 5, 7]


## Exercise 4

1. Given two arrays: `arr1=np.array([1,3,5])` and `arr2=np.array([2,4,5])`, stack them one on top of another to obtain `arr`. Use `.vstack` for this (see documentation).
2. How can I call entry 6 in `arr`?

In [156]:
arr1=np.array([1,3,5])
arr2=np.array([2,4,5])

In [169]:
# like the concat function 
# stack arrays in sequence vertically (row wise)
a = np.vstack((arr1, arr2))
a

array([[1, 3, 5],
       [2, 4, 5]])

In [171]:
# call the entry 6 in arr
a[1][2]

5

## Exercise 5

1. Using the `tips` dataframe from above, find the largest total tip left.
2. Isolate the row/observation which this corresponds to. Hint: This is one of the rare times where we need to use `tips["total_bill"]` rather than `tips[["total_bill"]]`.
3. For each day of the week, find the average total bill left.

In [22]:
import seaborn as sns
import numpy as np 
import pandas as pd

tips=sns.load_dataset("tips")
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [194]:
# find the largest total tip left
max(tips['tip'])

10.0

In [188]:
print(tips["total_bill"])
print(type(tips["total_bill"]))

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64
<class 'pandas.core.series.Series'>


In [190]:
print(type(tips[["total_bill"]]))
tips[["total_bill"]]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,total_bill
0,16.99
1,10.34
2,21.01
3,23.68
4,24.59
...,...
239,29.03
240,27.18
241,22.67
242,17.82


In [202]:
# Isolate the row/observation which this corresponds to largest tips paid
# (extract the row for largest tips)
cor_row = tips[(tips.tip == max(tips['tip']))]
cor_row

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
170,50.81,10.0,Male,Yes,Sat,Dinner,3


In [203]:
cor_row[['sex', 'tip']]

Unnamed: 0,sex,tip
170,Male,10.0


In [205]:
type(cor_row['tip'])

pandas.core.series.Series

In [41]:
# For each day of the week, find the average total bill left
s = tips.groupby('day')['total_bill'].aggregate(np.mean)
df = pd.DataFrame(s)
df.loc[['Thur']]


Unnamed: 0_level_0,total_bill
day,Unnamed: 1_level_1
Thur,17.682742


In [40]:
df

Unnamed: 0_level_0,total_bill
day,Unnamed: 1_level_1
Thur,17.682742
Fri,17.151579
Sat,20.441379
Sun,21.41


In [38]:
s = tips.groupby('day')['total_bill'].transform(np.mean)
df = pd.DataFrame(s)
df

Unnamed: 0,total_bill
0,21.410000
1,21.410000
2,21.410000
3,21.410000
4,21.410000
...,...
239,20.441379
240,20.441379
241,20.441379
242,20.441379


In [23]:
tips2 = tips.copy()
tips2.loc[:, 'ave_bill'] = df
tips2

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,ave_bill
0,16.99,1.01,Female,No,Sun,Dinner,2,21.410000
1,10.34,1.66,Male,No,Sun,Dinner,3,21.410000
2,21.01,3.50,Male,No,Sun,Dinner,3,21.410000
3,23.68,3.31,Male,No,Sun,Dinner,2,21.410000
4,24.59,3.61,Female,No,Sun,Dinner,4,21.410000
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.441379
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.441379
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.441379
242,17.82,1.75,Male,No,Sat,Dinner,2,20.441379


In [49]:
df = tips2.pivot(index='tip', columns='total_bill', values='size')
df

ValueError: Index contains duplicate entries, cannot reshape

In [50]:
df2 = pd.DataFrame({

'id_user':[     1,      2,      3,      4,       4,       5,      5], 
'information':['phone','phon','phone','phone1','phone','phone1','phone'], 
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})

df2.pivot(index='id_user', columns='information', values='value')

information,phon,phone,phone1
id_user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,,1,
2,01.01.00,,
3,,01.02.00,
4,,01.03.00,2.0
5,,01.04.00,3.0


In [46]:
df2 = pd.DataFrame({

'id_user':[        1,       2,      3,      4,      4,      5,      5], 
'information':['phone','phone','phone','phone','phone','phone','phone'], 
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})

df2.pivot(index='id_user', columns='information', values='value')

ValueError: Index contains duplicate entries, cannot reshape