<a href="https://colab.research.google.com/github/NicoPatalagua/Pandas/blob/master/PandasExercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas Exercises**
## *Patalagua Suárez Nicolás*
### Universidad Sergio Arboleda

### ***What is Pandas?***

In Computing and Data Science, pandas is a software library written as an extension to NumPy for data manipulation and analysis for the Python programming language. In particular, it offers data structures and operations for number tables and time series. It is free software distributed under the BSD version three clauses license.

 https://pandas.pydata.org/

### ***What is Numpy?***

NumPy is a Python extension, which adds more support for vectors and matrices, constituting a library of high-level mathematical functions to operate with those vectors or matrices. NumPy's ancestor Numeric was originally created by Jim Hugunin with some contributions from other developers. In 2005 Travis Oliphant created NumPy incorporating Numarray features into NumPy with some modifications.


 https://numpy.org/

### ***Repository***
This work was done taking into account the repository https://github.com/guipsamora/pandas_exercises, developed by Guilherme Samora, who is a Senior Product Manager at the Global Savings Group in Munich - Germany.

## ***Getting and knowing***

### ***Chiotle***

Dataset and materials: https://github.com/justmarkham 

**Step 1.** *Import the necessary libraries*

In [0]:
#Import the pandas library and assign it to the variable pd
import pandas as pd
#Import the numpy library and assign it to the variable np
import numpy as np

**Step 2.** *Import the dataset from this address.*


In [0]:
#Assign the dataset variable the path of the repository where the file to be used is
dataset = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'

**Step 3.** *Assign it to a variable called chipo.*

In [0]:
#Create the variable chipo
chipo = pd.read_csv(dataset, sep = '\t')

**Step 4.** *See the first 10 entries.*

In [6]:
#With head we show a specific amount of data
chipo.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


**Step 5.** *What is the number of observations in the dataset?*

In [8]:
#With shape with a 0 we can show the number of rows in the file
chipo.shape[0]

4622

**Step 6.** *What is the number of columns in the dataset?*

In [9]:
#With shape with a 1 we can show the number of columns in the file
chipo.shape[1]

5

**Step 7.** *Print the name of all the columns.*

In [11]:
#If we add columns to the variable that reads the file, it prints the name of the columns
chipo.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

**Step 8.** *How is the dataset indexed?*

In [13]:
#The index method returns the number of indexes of the variable
chipo.index

RangeIndex(start=0, stop=4622, step=1)

**Step 9.** *Which was the most-ordered item?*

In [14]:
#Group by items with groupby,Sum the grouped data with the previous method 
#and Order the values ​​by quantity
ObjItem= chipo.groupby('item_name').sum().sort_values(['quantity'], ascending=False)
#Print the Item most-ordered
ObjItem.iloc[0,0:0] 

Series([], Name: Chicken Bowl, dtype: int64)

**Step 10.** *For the most-ordered item, how many items were ordered?*

In [15]:
#Print Quantity of ordered items
ObjItem.iloc[0,1] 

761

**Step 11.** *What was the most ordered item in the choice_description column?*

In [0]:
#Group by items with groupby,Sum the grouped data with the previous method 
#and Order the values ​​by quantity
ObjItem= chipo.groupby('choice_description').sum().sort_values(['quantity'], ascending=False)
#Print the Item most-ordered
ObjItem.iloc[0,0:1] 

order_id    123455
Name: [Diet Coke], dtype: int64

**Step 12.** *How many items were orderd in total?*

In [0]:
#Count the quantity of items ordered with the quantity column
chipo.quantity.sum()

4972

**Step 13.** *Turn the item price into a float.*

> **a.** *Check the item price type.*



In [0]:
#with the dtype method we return the data type of the selected column
chipo.item_price.dtype

dtype('O')



> **b.** *Create a lambda function and change the type of item price.*



In [0]:
#Assign the column itemprice to chipo and later we apply the lambda function
chipo.item_price=chipo.item_price.apply(lambda x: float(x[1:-1]))


>**c.** *Check the item price type.*



In [0]:
#recheck data type
chipo.item_price.dtype

dtype('float64')

**Step 14.** *How much was the revenue for the period in the dataset?*

In [0]:
#We multiply the quantity by the price and add the result
(chipo['quantity']*chipo['item_price']).sum()

39237.02

**Step 15.** *How many orders were made in the period?*

In [0]:
#Count the number of order data with valuecounts
#With count return the length that corresponds to the number of orders
chipo.order_id.value_counts().count()

1834

**Step 16.** *What is the average revenue amount per order?*

In [0]:
#With the group method ordered by order_id, 
#Add the data and get the average (Specify revenue)
chipo.groupby('order_id').sum().mean()['revenue']

21.394231188658654

**Step 17.** *How many different items are sold?*

In [0]:
#Count the number of different items 
#Count the number of items obtained
chipo.item_name.value_counts().count()

50

### ***Occupation***

Dataset and materials: https://github.com/justmarkham

**Step 1.** *Import the necessary libraries.*

In [0]:
#Import the pandas library and assign it to the variable pd
import pandas as pd
#Import the numpy library and assign it to the variable np
import numpy as np

**Step 2.** *Import the dataset from this address.*

In [0]:
#Assign the dataset variable the path of the repository where the file to be used is
dataset = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'

**Step 3.** *Assign it to a variable called users and use the 'user_id' as index.*

In [0]:
#Create the variable users and use user_id as index
users = pd.read_csv(dataset, sep='|', index_col='user_id')

**Step 4.** *See the first 25 entries.*

In [22]:
#With head we show a specific amount of data
users.head(25)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


**Step 5.** *See the last 10 entries.*

In [23]:
#The head method returns the header, however tail returns the data at the end of a file.
users.tail(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
934,61,M,engineer,22902
935,42,M,doctor,66221
936,24,M,other,32789
937,48,M,educator,98072
938,38,F,technician,55038
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


**Step 6.** *What is the number of observations in the dataset?*

In [24]:
#With shape with a 0 we can show the number of rows in the file
users.shape[0]

943

**Step 7.** *What is the number of columns in the dataset?*

In [25]:
#With shape with a 1 we can show the number of columns in the file
users.shape[1]

4

**Step 8.** *Print the name of all the columns.*

In [26]:
#If we add columns to the variable that reads the file, it prints the name of the columns
users.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

**Step 9.** *How is the dataset indexed?*

In [28]:
#The index method returns the indexing of the dataset
users.index

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
           dtype='int64', name='user_id', length=943)

**Step 10.** *What is the data type of each column?*

In [29]:
#with the dtype method we return the data type of the selected column
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Step 11.** *Print only the occupation column.*

In [31]:
#the required column can be called as a method to return it
users.occupation

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

**Step 12.** *How many different occupations are in this dataset?*

In [33]:
#Adding the count of the differences between data to the previous query achieves what is required in this
users.occupation.value_counts().count()

21

**Step 13.** *What is the most frequent occupation?*

In [36]:
#Show the header of the previous command
users.occupation.value_counts().head()

student          196
other            105
educator          95
administrator     79
engineer          67
Name: occupation, dtype: int64

**Step 14.** *Summarize the DataFrame.*

In [37]:
#With the describe method, this information is obtained
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


**Step 15.** *Summarize all the columns.*

In [39]:
#If everything is defined for the previous query, send the detail of everything
users.describe(include = "all")

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


**Step 16.** *Summarize only the occupation column.*

In [41]:
#Only the describe method is called
users.occupation.describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

**Step 17.** *What is the mean age of users?*

In [45]:
#With the age method and the average method applied to the list of users
users.age.mean()

34.05196182396607

In [46]:
#And it is rounded
round(users.age.mean())

34

**Step 18.** *What is the age with least occurrence?*

In [51]:
#It is counted by age repetitions and the queue is returned from the list
users.age.value_counts().tail()

11    1
10    1
73    1
66    1
7     1
Name: age, dtype: int64

### ***World Food Facts***


**Step 1.** *Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data*

In [0]:
#Import the pandas library and assign it to the variable pd
import pandas as pd
#Import the numpy library and assign it to the variable np
import numpy as np

**Step 2.** *Download the dataset to your computer and unzip it.*

In [0]:
dataset='en.openfoodfacts.org.products.tsv'

**Step 3.** *Use the tsv file and assign it to a dataframe called food.*

In [58]:
#Create the variable food
food = pd.read_csv(dataset, sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


**Step 4.** *See the first 5 entries.*

In [60]:
#With head we show a specific amount of data
food.head(5)

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,...,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,,,Ferme t'y R'nao,ferme-t-y-r-nao,,,,,,,,,,,,,,,,,,en:FR,en:france,France,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,,,,,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Bananas, vegetable oil (coconut oil, corn oil ...",,,,,,...,,0.0214,,,,,,,,,,,,,,0.0,,0.00129,,,,,,,,,,,,,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,,,Torn & Glasser,torn-glasser,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Peanuts, wheat flour, sugar, rice flour, tapio...",,,,,,...,,0.0,,,,,,,,,,,,,,0.071,,0.00129,,,,,,,,,,,,,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,,,Grizzlies,grizzlies,,,,,,,,,,,,,,,,,,US,en:united-states,United States,"Organic hazelnuts, organic cashews, organic wa...",,,,,,...,,,,,,,,,,,,,,,,0.143,,0.00514,,,,,,,,,,,,,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,,,Bob's Red Mill,bob-s-red-mill,,,,,,,,,,,,,,,,,,US,en:united-states,United States,Organic polenta,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


**Step 5.** *What is the number of observations in the dataset?*

In [69]:
#With shape we can show the number of rows in the file
food.shape

(53805, 163)

**Step 6.** *What is the number of columns in the dataset?*

In [70]:
#With shape with a 0 we can show the number of rows in the file
food.shape[1]

163

**Step 7.** *Print the name of all the columns.*

In [72]:
#If we add columns to the variable that reads the file, it prints the name of the columns
food.columns

Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name',
       'generic_name', 'quantity',
       ...
       'fruits-vegetables-nuts_100g', 'fruits-vegetables-nuts-estimate_100g',
       'collagen-meat-protein-ratio_100g', 'cocoa_100g', 'chlorophyl_100g',
       'carbon-footprint_100g', 'nutrition-score-fr_100g',
       'nutrition-score-uk_100g', 'glycemic-index_100g',
       'water-hardness_100g'],
      dtype='object', length=163)

**Step 8.** *What is the name of 105th column?*

In [73]:
#The first column begins its count at 0 therefore 104 is required
food.columns[104]

'-glucose_100g'


**Step 9.** *What is the type of the observations of the 105th column?*

In [74]:
#the previous query is called from the dtypes method of the food variable
food.dtypes[food.columns[104]]

dtype('float64')

**Step 10.** *How is the dataset indexed?*

In [75]:
#The index method returns the number of indexes of the variable
food.index

RangeIndex(start=0, stop=53805, step=1)

**Step 11.** *What is the product name of the 19th observation?*

In [76]:
#It's called the observation column and the corresponding row
food.values[18][7]

'Lotus Organic Brown Jasmine Rice'

## ***Filtering and Sorting***

## ***Grouping***

## ***Apply***

## ***Merge***

## ***Stats***


## ***Visualization***

## ***Creating Series and DataFrames***

## ***Time Series***

## ***Deleting***