<a href="https://colab.research.google.com/github/SebastienBienfait/L2C-Data-managment/blob/main/Working_with_Data_Lists.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lists

Often we need to store a number of single items of data together so that they can be processed together. This might be because all the data refers to one person (e.g. name, age, gender, etc) OR it might be because we have a set of data (e.g. all the items that should be displayed in a drop down list, such as all the years from this year back to 100 years ago so that someone can select their year of birth)

Python has a range of data structures available including:
*   lists  
*   tuples  
*   dictionaries  
*   sets

This worksheet looks at lists.

## List
A list is a set of related, individual data objects, that are indexed and can be processed as a whole, as subsets or as individual items.  Lists are stored, essentially, as contiguous items in memory so that access can be as quick as possible.  However, they are mutable (they can be changed after they have been created and stored) and so they need to have extra functionality to deal with changing list sizes.

# Let's get some lists of data
For this worksheet we are going to work with data on STEAM games.  We are going to get the data from a spreadsheet and make lists that we can find things out from.



## Creating a list
---

```
nums = [1, 2, 3, 4, 5]
names = ["Tom","Jerry","Spike"]
```

## Exercise 1
---
Write a function **print_list()** that will create the two lists `nums` and `names`, then will print them as lists, e.g.

```
print(nums)
```

In [65]:
def print_list():
  nums = [1, 2, 3, 4, 5]
  names = ["Tom","Jerry","Spike"]

  print(nums,names)
print_list()

[1, 2, 3, 4, 5] ['Tom', 'Jerry', 'Spike']


# Exercise 2
---

Write a function called **print_1st_3rd()** that will print the 1st and 3rd item in the names list.

In [66]:
def print_1st_3rd():
  nums = [1, 2, 3, 4, 5]
  names = ["Tom","Jerry","Spike"]

  return names[0], names[2]
print(print_1st_3rd())


('Tom', 'Spike')


## Exercise 3  - Print a subset of a list
---

Write a function **print_first_2()** which will create the nums list, then print print it


In [67]:
def print_2nd():
  nums = [1, 2, 3, 4, 5]
  names = ["Tom","Jerry","Spike"]

  return nums[:2]
print(print_2nd())

[1, 2]


# List length

Use the len() function to get the number of items in a list.

There are 5 items in the nums list and 3 in the names list.

Write a function **print_list_info()** that will:
* create both lists
* print the length of the nums list
* print the length of the names list
* concatenate (add) the two lists together to make a new list called num_names
* print the length of the new list

Expected output:
```
The length of the nums list is: 5
The length of the names list is: 3
The length of the joined list is: 8
```


In [68]:
def list_join():
  nums = [1, 2, 3, 4, 5]
  names = ["Tom","Jerry","Spike"]

  print("lenght of nums list: ",len(nums),"\n length of names list: ",len(names))
  num_names = nums+names
  print("length of joined list is: ", len(num_names))
list_join()

lenght of nums list:  5 
 length of names list:  3
length of joined list is:  8


# List methods

You can get an overview of the methods you can use here: https://www.w3schools.com/python/python_lists_methods.asp

Then: 
1.  Create the nums and names list again 
2.  Append the number 6 to the nums list, and print
3.  Insert the name "Sylvester" before "Jerry" in the names list and print
4.  Print the length of the nums list
5.  Remove the number 4 from the nums list, and print
6.  Print the max and min of the nums list
7.  Create a new list called new_nums which contains the numbers 40 to 50 (use the range function)

**Expected output**: 
``` 
[1, 2, 3, 4, 5, 6]
['Tom', 'Sylvester', 'Jerry', 'Spike']
6
[1, 2, 3, 5, 6]
6 1
range(40, 51)
```

In [69]:
def list_stuff():
  nums = [1, 2, 3, 4, 5]
  names = ["Tom","Jerry","Spike"]

  nums.append(6)
  print(nums)

  names.insert(1,"Sylverster")
  print(names)

  print(max(nums),min(nums))

  new_nums = range(40,51)
  print(new_nums)
list_stuff()

[1, 2, 3, 4, 5, 6]
['Tom', 'Sylverster', 'Jerry', 'Spike']
6 1
range(40, 51)


# Now some real data
---

1.  Open the STEAM csv file (which we have taken from Kaggle and have reduced to make it more manageable): https://drive.google.com/file/d/1amPnoBi3uhQXjFaQbUy-L-Y-eeJ1BcxE/view?usp=sharing  

2.  Open the file with Google sheets to see what is in it.  The file contains rows of data, each with a user id and a game that the user has purchased.

3.  NOW, run the code in the cell below to get:  
- users (the list of user ids in the data)
- titles (the list of titles that have been purchased)

In [70]:
import pandas as pd

# open the data file and get a copy of the Titles column
def get_users_and_titles():
  url = "https://drive.google.com/uc?id=1rkG8-cp-KLBc1zK4YMLHIsMMyyTVk5Ju"
  data_table = pd.read_csv(url)
  return data_table["User"].tolist(), data_table["Title"].tolist(), data_table

users, titles, df = get_users_and_titles()

---
### Exercise 1 - list head, tail and length of the titles list
---

Write a function, **describe_list()** which will:
*  print the length of the list `titles`
*  print the first 10 items in `titles` (the head)  
*  print the last 5 items in `titles` (the tail)

Expected output:  
```
129511
['The Elder Scrolls V Skyrim', 'Fallout 4', 'Spore', 'Fallout New Vegas', 'Left 4 Dead 2', 'HuniePop', 'Path of Exile', 'Poly Bridge', 'Left 4 Dead', 'Team Fortress 2']
['Fallen Earth', 'Magic Duels', 'Titan Souls', 'Grand Theft Auto Vice City', 'RUSH']
```

In [71]:
def print_info():
  size_title = len(titles)
  print(len(titles))
  titels_head = titles[:10]
  print(titels_head)

  print(len(titels_head))

  newlist = []
  for i in range(size_title-5,size_title):
    newlist.append(titles[i])
  print(newlist)
  titles_tail = df["Title"].tail(5)
  print(titles_tail)
print_info()

129511
['The Elder Scrolls V Skyrim', 'Fallout 4', 'Spore', 'Fallout New Vegas', 'Left 4 Dead 2', 'HuniePop', 'Path of Exile', 'Poly Bridge', 'Left 4 Dead', 'Team Fortress 2']
10
['Fallen Earth', 'Magic Duels', 'Titan Souls', 'Grand Theft Auto Vice City', 'RUSH']
129506                  Fallen Earth
129507                   Magic Duels
129508                   Titan Souls
129509    Grand Theft Auto Vice City
129510                          RUSH
Name: Title, dtype: object


---
### Exercise 2 - use a loop to print the first 20 items

Write a function which will:
*  create a new list from the first 20 items of the titles list
*  loop through the new list and print each item


In [72]:
def print_list():
  titles_first20 = []
  for i in range(20):
    titles_first20.append(titles[i])
  print(titles_first20)
print_list()

['The Elder Scrolls V Skyrim', 'Fallout 4', 'Spore', 'Fallout New Vegas', 'Left 4 Dead 2', 'HuniePop', 'Path of Exile', 'Poly Bridge', 'Left 4 Dead', 'Team Fortress 2', 'Tomb Raider', 'The Banner Saga', 'Dead Island Epidemic', 'BioShock Infinite', 'Dragon Age Origins - Ultimate Edition', 'Fallout 3 - Game of the Year Edition', 'SEGA Genesis & Mega Drive Classics', 'Grand Theft Auto IV', 'Realm of the Mad God', 'Marvel Heroes 2015']


---
### Exercise 3 - count the number of times a title appears in the list

Write a function which will:
*  count the number of times that the title Fallout 4 appears in the list

Expected output:  
168

In [73]:
def count_title():
  print(titles.count("Fallout 4"))


count_title()

168


---
### Exercise 4 - remove all duplicates of a title from the list

Write a function which will: remove all occurences of Fallout 4 from the titles list (Hint:  you can remove an occurence of Fallout 4 repeatedly until there is only one left).  This will require a while loop.


In [74]:
#def remove_duplicates():
#  for title in titles:
#    while titles.count(title) > 1:
#        titles.remove(title)
#  print(titles)

def remove_duplicates():
  title_set = set(titles)
  title_list = list(title_set)

  #print(title_set)
  #print(title_list)

remove_duplicates()


---
### Exercise 5 - print the counts of the first 10 titles in the list

Write a function which will:
* loop through the first 10 items in the titles list
* for each item print the number of times that title appears in the list


In [75]:
def print_count_of_first_ten():
  new_list = []
  for i in range(10):
    new_list.append(titles[i])
  for i in new_list:
    print(i, ",  ",new_list.count(i))
print_count_of_first_ten()

The Elder Scrolls V Skyrim ,   1
Fallout 4 ,   1
Spore ,   1
Fallout New Vegas ,   1
Left 4 Dead 2 ,   1
HuniePop ,   1
Path of Exile ,   1
Poly Bridge ,   1
Left 4 Dead ,   1
Team Fortress 2 ,   1


---
### Project - work as a team

The users list has the ids of all the users who have purchased STEAM games.

Write a function that will, for the first 100 users:
* count how many games have been purchased by each user.  
* calculate the percentage of all purchases made by each user
* calculate the percentage of all purchases made by these 100 users altogether
* find the id of the user who has purchased the most games of these 100 users 
* calculate the average number of games purchased by a user from the 100 
* print this information, printing each unique user just once  
Do the same with the last 100 users  

Divide up the tasks and each write one part, then try to get them all to work together.

In [98]:
users_set = set(users)
users_list = list(users_set)
users_list_top100 = users_list[:100]
users_list_bottom100 = []

for i in range(len(users_list)-100,len(users_list)):
  users_list_bottom100.append(users_list[i])

def print_100_users(users_list):

  user_count_dict = {} 

  dictionary_list = []
  total_bought = 0
  for user in users_list:
    #id_count_index.append(users.count(user))
    user_count_dict[user] = users.count(user)
    total_bought += users.count(user)

  for user in users_list:
    user_bought_count = users.count(user)
    new_dictionary = {"id":user,"Games Bought: ": user_bought_count, "Percentage(%): ":round((user_bought_count*100)/total_bought,2) }
    dictionary_list.append(new_dictionary)
  
  max_value=0
  #max_bought_count = max(user_count_dict.values())
  for k,v in user_count_dict.items():
    if v> max_value:
          max_value = v
          max_key = k
  average_bought = total_bought/100
  for i in user_count_dict.items():

    print(i)
  print("id with most games purchesed: ",max_key, " with ", max_value," games.")
  print("average games purchesed per user: ",average_bought,".")
print_100_users(users_list_bottom100)
#get_users_and_titles()

(305889009, 1)
(220561138, 21)
(185401083, 1)
(43908860, 17)
(29753085, 88)
(130678525, 3)
(92929791, 6)
(16154372, 119)
(241303303, 2)
(35290890, 6)
(140050187, 6)
(52559629, 1)
(122552077, 1)
(302186258, 1)
(188284694, 1)
(622362, 10)
(38436635, 6)
(195723035, 1)
(117210908, 1)
(93454114, 1)
(181010210, 35)
(231472933, 1)
(141721383, 43)
(146145063, 12)
(205881130, 1)
(256573229, 1)
(47021871, 1)
(193265457, 2)
(192282418, 1)
(115736372, 1)
(159416133, 1)
(222986054, 1)
(256606025, 3)
(102465357, 2)
(162692942, 1)
(224526163, 3)
(204537690, 3)
(147619675, 1)
(102825821, 136)
(298614625, 1)
(141721441, 2)
(111804259, 1)
(75366244, 7)
(169344872, 1)
(198639468, 1)
(205029231, 2)
(108887922, 1)
(293076852, 1)
(31063925, 2)
(60227446, 5)
(208994172, 2)
(260538236, 7)
(16285568, 4)
(129466241, 2)
(154730370, 61)
(216924034, 1)
(109969285, 1)
(300089223, 1)
(212828039, 1)
(257064839, 1)
(141918091, 2)
(95780749, 34)
(189792144, 3)
(17530772, 307)
(82542485, 1)
(233209749, 3)
(307232663, 1)