

As a member of the analytical team, the first step is to assess the quality of a sample of collected data and prepare it for future analysis. Later on, in the second part of this project in the 2nd sprint, you will further develop your skills and make your first complete analysis, answering to the client's needs.

This is the data the client provided us. It is formatted as a python list, with the following column data:

- **user_id:** Unique identifier for each user.
- **user_name:** The name of the user.
- **user_age:** Age of the user.
- **fav_categories:** Categories of items purchased by the user, such as 'ELECTRONICS', 'SPORT', 'BOOKS', etc.
- **total_spendings:** List of integers indicating the total the amount spent in each of their favorite categories.

In [None]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
    ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]],
    ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]],
    ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]],
    ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]],
    ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]],
    ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]],
    ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]],
]


# Step 1

Store 1 aims to ensure consistency in data collection. As part of this effort, the quality of the data collected on users needs to be evaluated. You have been asked to review the collected data and propose changes. Below you will see data about a particular user. Please review the data and identify any potential issues.

In [None]:
user_id = '32415'
user_name = ' mike_reed '
user_age = 32.0
fav_categories = ['ELECTRONICS', 'SPORT', 'BOOKS']

**Options:**

1. The data type for `user_id` should be changed from a string to an integer.
    
2. The `user_name` variable contains a string that has unnecessary spacing and an underscore between the first and last names.
    
3. The data type of `user_age` is correct and there is no need to convert it.
    
4. The `fav_categories` list contains strings in upper case. We should not convert the values in the list to lower case instead.

For each of the options, write in the markdown cell below whether you have identified it as a real issue in the data or not. Justify your reasoning. For example, if you believe the first option is correct, write it down and explain why you think it is correct.

**Write your answer and explain your reasoning:**

# Step 2

Let's implement the changes we identified. First, we want to correct the issues with the `user_name` variable. As we found, it has unnecessary spaces and an underscore as a separator between the first and the last name. Your goal is to remove the spaces and then replace the underscore with the space.

In [None]:
user_name = ' mike_reed '
user_name = user_name.strip()
user_name = user_name.replace('_', ' ')

print(user_name)

mike reed


# Step 3

Next, we need to split the updated `user_name` into two substrings to obtain a list that contains two values: the string for the first name and the string for the last name.

In [None]:
user_name = 'mike reed'
name_split = user_name.split()

print(name_split)

['mike', 'reed']


# Step 4

Great! Now we want to work with the `user_age` variable. As we mentioned earlier, it has an incorrect data type. Let's fix this issue by transforming the data type and print the final result.

In [None]:
user_age = 32.0
user_age = int(user_age)

print(user_age)

32


# Step 5

As we all know, data is not always perfect. We have to consider scenarios where the `user_age` value cannot be converted to an integer. To prevent our system from crashing, we must take steps in advance.

Write a code that attempts to convert the `user_age` variable to an integer and assigns the transformed value to `user_age_int`. If the attempt fails, we print a message, asking a user to provide their age as a numerical value with the message: `Please provide your age as a numerical value.`

In [None]:
user_age = 'thirty two'

try:
	user_age_int = int(user_age)
except:
	print('Please provide your age as a numerical value.')

Please provide your age as a numerical value.


# Step 6

The management team of Store 1 has asked you to help them organize their customer data for better analysis and management.

Your task is to sort this list by user ID in ascending order to facilitate easier access and analysis.

In [None]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
    ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]],
    ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]],
    ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]],
    ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]],
    ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]],
    ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]],
    ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]],
]
users.sort()

print(users)

32


# Step 7

We have the information about our user’s spending habits, including the amount spent in each of their favorite categories. Management is interested to know the total amount spent by the user.


Let's calculate this value and print it:

In [None]:
fav_categories_low = ['electronics', 'sport', 'books']
spendings_per_category = [894, 213, 173]

total_amount = spendings_per_category[0] + spendings_per_category[1] + spendings_per_category[2]


print(total_amount)


1280


# Step 8

The management of the company asked us to come up with a way to summarize all of the information about a user. Your goal is to create a formatted string that uses information from the `user_id`, `user_name` and `user_age` variables.

Here is the final string that we want to create: `User 32415 is mike who is 32 years old.`

In [None]:
user_id = '32415'
user_name = ['mike', 'reed']
user_age = 32

user_info = f'User {user_id} is {user_name[0]} who is {user_age} years old.'
print(user_info)

User 32415 is mike who is 32 years old.


# Step 9

Management also wants an easy way to know how many client data we have. Your goal is to create a formatted string that will output the amount of client data registered.

Here is the final string that we want to create: `We have registered data on X clients.`

In [None]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
    ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]],
    ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]],
    ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]],
    ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]],
    ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]],
    ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]],
    ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]],
]


user_info = f'Hemos registrado datos de {len(users)} clientes.'
print(user_info)

Hemos registrado datos de 10 clientes.


# Step 10

Now let's apply all the changes to the client list. We will provide you with a shorter one for the sake of simplicity.
You should:
1. Remove all leading, trailing spaces from the names, as well as any underscore.
2. Convert all the ages to integer.
3. Separate all the first names and last names into a sub-list.

Save the altered list into a new list called `users_clean`, and then print the new list.

In [None]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
]

users_clean = []


# Process the first user
user_name_1 = users[0][1].strip().replace('_', ' ')
user_age_1 = int(users[0][2])
user_name_1 = user_name_1.split()
users_clean.append([users[0][0], user_name_1, user_age_1, users[0][3], users[0][4]])

# Process the second user
user_name_2 = users[1][1].strip().replace('_', ' ')
user_age_2 = int(users[1][2])
user_name_2 = user_name_2.split()
users_clean.append([users[1][0], user_name_2, user_age_2, users[1][3], users[1][4]])

# Process the third user
user_name_3 = users[2][1].strip().replace('_', ' ')
user_age_3 = int(users[2][2])
user_name_3 = user_name_3.split()
users_clean.append([users[2][0], user_name_3, user_age_3, users[2][3], users[2][4]])



print(users_clean)


[['32415', ['mike', 'reed'], 32, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]], ['31980', ['kate', 'morgan'], 24, ['CLOTHES', 'BOOKS'], [439, 390]], ['32156', ['john', 'doe'], 37, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]]]
