# Individual Assignment: Wine!

There are 10 questions in this assignment. Some questions unlock others. If you can't answer a question, you can skip it and come back to it later, and if the question is locked, I'll provide a data sample for you to use, **at a cost of 0.5 points.**

The wine dataset is available in the file `wine.json`. This data contains information about wine reviews. It's a list of dictionaries, where each dictionary represents a wine review. The keys in the dictionary are:

* `points`: how many points the taster gave the wine on a scale of 1-100
* `title`: the title of the wine
* `description`: a description of the wine
* `taster_name`: the name of the taster
* `taster_twitter_handle`: the twitter handle of the taster
* `price`: the cost for a bottle of the wine
* `designation`: the vineyard within the winery where the grapes that made the wine are from
* `variety`: the type of grapes used to make the wine
* `region_1`: the province or state that the wine is from
* `region_2`: a more specific region within a wine growing area
* `province`: the province or state that the wine is from
* `country`: the country that the wine is from
* `winery`: the winery that made the wine



### Rules:
* When I ask a question, print the answer in the cell below the question. You can use `print()` or just type the variable name.
* You can use any resources you like --including the internet, your notes, and the past notebooks-- except ChatGPT/Copilot/other AI writing tools. Using AI-based tools or asking other people for help will result in a 0 for the assignment, an immediate Fail in the course, and a report to the Dean of Students.
* You have 80 minutes to complete the assignment.
* You can submit the assignment as many times as you like, only the last submission will be graded.
* You can't work with other people on the assignment.

### Question 1 (1 point)

Read the JSON data into a variable called `wine`. Remember to make sure that Python is looking in the right place for the file. You can check the current working directory with the following code:

```python
import os
os.getcwd()
```

If the file is not in the current working directory, you can change the working directory with the following code:

```python
os.chdir('path/to/file')
```

How many reviews are in the data?

**HELP: If you can't figure out how to read the JSON file, raise your hand and I'll tell you how to do it. (-0.5 points)**

In [None]:
import json

with open("wine-data-set.json") as f:
    wines = json.load(f)

c:\Users\SABIO\OneDrive\Documents\GitHub\IE-University\IE_MASTERS\7_PYTHON_FOR_DATA_ANALYSIS\individual_assignment_wine_2023_sept


FileNotFoundError: [Errno 2] No such file or directory: '/7_PYTHON_FOR_DATA_ANALYSIS/wine-data-set.json'

In [9]:
reviews = len(wines)

print(reviews)

129971


### Question 2 (1 point)

Create a list called `prices` containing all the prices in the dataset. What is the average price of a bottle of wine in the dataset?

*Hint*: given a list of numbers, you can calculate the mean value of all numbers in the list with the following code:

```python
prices = [1, 2, 3, 4, 5]

sum(prices) / len(prices)
```

In [10]:
prices = [wine['price'] for wine in wines if wine['price'] != None]

avg_price = sum(prices) / len(prices)

print(f"The average price of a wine is {avg_price:.2f}")

The average price of a wine is 35.36


### Question 3 (1 point)

Build a dictionary with the following structure:
```python
ratings = {
    country: average points of all its wines
}
```

What is the `country` whose wines have the highest average `points`?

Hints:
* The elements in the JSON file might not be numbers, so make sure you're only calculating the average of numbers, not strings.
* Remember to use the unique countries in the dataset, so as not to double count.
* Printing the countries and finding the highest average is not fully correct. You have to do it with code.

In [11]:
# average points per country
countries = {wine['country'] for wine in wines}

ratings = {}

for country in countries:
    ratings_country = []
    for wine in wines:
        if wine['country'] == country:
            ratings_country.append(float(wine['points']))

    ratings[country] = sum(ratings_country) / len(ratings_country)

# extract the country with the highest average rating
country_with_highest_rating = max(ratings, key=ratings.get)

print(f"The country with the highest average rating is {country_with_highest_rating}")


The country with the highest average rating is England


### Question 4 (1 point)

Using the `ratings` dictionary created in the previous answer, what are the average ratings of the following countries:

* `Egypt`
* `Slovenia`
* `Uruguay`

**HELP: If you couldn't create the `ratings` dictionary in Q3, use the following dictionary to solve it. (-0.5 points)**
```python
ratings_dictionary = {'England': 91.58108108108108, 'India': 90.22222222222223, 'Austria': 90.10134529147982, 'Germany': 89.85173210161663, 'Canada': 89.36964980544747, 'Hungary': 89.1917808219178, 'China': 89.0, 'France': 88.84510931064138, 'Luxembourg': 88.66666666666667, None: 88.63492063492063, 'Australia': 88.58050665521684, 'Switzerland': 88.57142857142857, 'Morocco': 88.57142857142857, 'US': 88.56372009393806, 'Italy': 88.56223132036847, 'Israel': 88.47128712871287, 'New Zealand': 88.3030303030303, 'Portugal': 88.25021964505359, 'Turkey': 88.08888888888889, 'Slovenia': 88.06896551724138, 'South Africa': 88.05638829407566, 'Bulgaria': 87.93617021276596, 'Georgia': 87.68604651162791, 'Lebanon': 87.68571428571428, 'Armenia': 87.5, 'Serbia': 87.5, 'Spain': 87.28833709556058, 'Greece': 87.28326180257511, 'Czech Republic': 87.25, 'Croatia': 87.21917808219177, 'Moldova': 87.20338983050847, 'Cyprus': 87.18181818181819, 'Slovakia': 87.0, 'Macedonia': 86.83333333333333, 'Uruguay': 86.75229357798165, 'Argentina': 86.71026315789474, 'Bosnia and Herzegovina': 86.5, 'Chile': 86.4935152057245, 'Romania': 86.4, 'Mexico': 85.25714285714285, 'Brazil': 84.67307692307692, 'Ukraine': 84.07142857142857, 'Egypt': 84.0, 'Peru': 83.5625}
```

In [12]:
print(ratings["Egypt"])
print(ratings["Slovenia"])
print(ratings["Uruguay"])

84.0
88.06896551724138
86.75229357798165


### Question 5 (1 point)

Some data preparation: 
* If there is a wine that doesn't have a price, fill the price of that wine with the average price of all the wines in that country

For example:
* Country B has 3 wines with `[20, 10, None]` as prices
* Then calculate the average price of the wines with prices in that country, and substitute the `None`s with that average (in this case, the average of `[20, 10]`).
* Final prices for the three wines in country B should be `[20, 10, 15]`.

In [13]:
# average price of wines per country
unique_countries = {wine['country'] for wine in wines}

countries = {}

for country in unique_countries:
    prices_country = []
    for wine in wines:
        if wine['country'] == country and wine['price'] != None:
            ratings_country.append(float(wine['points']))

    countries[country] = sum(ratings_country) / len(ratings_country)

# filling in missing values
for wine in wines:
    if wine['price'] == None:
        wine['price'] = countries[wine['country']]


### Question 6 (1 point)

Similar to the `ratings` dictionary, create a new dictionary `countries` where the key is the `country` and the value is a tuple containing the following:
```python
countries = {
    country: (
        average points of all its wines,
        average price of all its wines,
        ratio of average points to average price
    )
}
```

For example (not real numbers):
```python
countries = {
    'France': (90, 20, 4.5),
    'Italy': (88, 15, 5.8),
    'Spain': (85, 10, 8.5)
}
```

We want lots of points on average at a lower price.

What is the `country` whose wines have the highest average `points` to `price` ratio?

In [14]:
# average price per country
unique_countries = {wine['country'] for wine in wines}

countries = {}

for country in unique_countries:

    ratings_country = []
    prices_country = []

    for wine in wines:
        if wine['country'] == country:
            ratings_country.append(float(wine['points']))
            prices_country.append(float(wine['price']))

    avg_points = sum(ratings_country) / len(ratings_country)
    avg_price = sum(prices_country) / len(prices_country)
    points_to_price_ratio = avg_points / avg_price

    countries[country] = (avg_points, avg_price, points_to_price_ratio)

# country with the best ratio of points to price
country_with_best_ratio = max(countries, key=lambda x: countries[x][2])

print(f"The country with the best ratio of points to price is {country_with_best_ratio}, with a ratio of {countries[country_with_best_ratio][2]:.2f}")
print(f"The average price of a wine from {country_with_best_ratio} is {countries[country_with_best_ratio][1]:.2f}")
print(f"The average rating of a wine from {country_with_best_ratio} is {countries[country_with_best_ratio][0]:.2f}")

The country with the best ratio of points to price is Ukraine, with a ratio of 9.12
The average price of a wine from Ukraine is 9.21
The average rating of a wine from Ukraine is 84.07


### Question 7 (1 point)

Using the `countries` dictionary created in the previous answer, which is the country with the lowest points-to-price ratios out of the following countries:

* `Cyprus`
* `Brazil`
* `India`

Again, you need to find the solution via code, not by printing the countries and finding the lowest ratio.

**HELP: If you couldn't create the `countries` dictionary, use the following dictionary to solve it. (-0.5 points)**
```python
{'Switzerland': (88.57142857142857, 85.28571428571429, 1.0385259631490786), 'Ukraine': (84.07142857142857, 9.214285714285714, 9.124031007751938), 'France': (88.84510931064138, 50.44096557625022, 1.761368131946972), 'New Zealand': (88.3030303030303, 28.71683227387118, 3.0749572049203806), 'Uruguay': (86.75229357798165, 26.40366972477064, 3.2856150104239057), 'Slovenia': (88.06896551724138, 29.952709646822846, 2.9402670594973386), 'Macedonia': (86.83333333333333, 15.583333333333334, 5.572192513368983), 'Morocco': (88.57142857142857, 19.5, 4.542124542124542), 'Germany': (89.85173210161663, 43.22511895400462, 2.0786925351721273), 'Moldova': (87.20338983050847, 16.74576271186441, 5.20748987854251), 'Cyprus': (87.18181818181819, 16.272727272727273, 5.357541899441341), 'Argentina': (86.71026315789474, 25.250984316518174, 3.4339359634852884), 'Lebanon': (87.68571428571428, 30.685714285714287, 2.8575418994413404), 'Austria': (90.10134529147982, 40.21345058732233, 2.2405773186717064), 'Canada': (89.36964980544747, 36.330749534908094, 2.4598900641886674), 'Egypt': (84.0, 88.66754349046016, 0.9473590526282896), 'Chile': (86.4935152057245, 21.632842067954186, 3.9982502037423777), 'China': (89.0, 18.0, 4.944444444444445), 'India': (90.22222222222223, 13.333333333333334, 6.766666666666667), 'Czech Republic': (87.25, 24.25, 3.597938144329897), 'Brazil': (84.67307692307692, 29.977986926550436, 2.8245084344901423), 'Hungary': (89.1917808219178, 40.97516808122459, 2.1767276377027667), 'Luxembourg': (88.66666666666667, 23.333333333333332, 3.8000000000000003), 'Greece': (87.28326180257511, 23.072528919739334, 3.7829950113488127), 'Turkey': (88.08888888888889, 24.633333333333333, 3.576003608479928), 'Croatia': (87.21917808219177, 27.174160416353985, 3.209636535070332), 'Australia': (88.58050665521684, 36.233156357592215, 2.4447361356266684), 'Mexico': (85.25714285714285, 26.785714285714285, 3.182933333333333), 'Armenia': (87.5, 14.5, 6.0344827586206895), 'South Africa': (88.05638829407566, 29.577821794179147, 2.977108622359912), 'Israel': (88.47128712871287, 33.5615655251901, 2.6360894000105426), 'Slovakia': (87.0, 16.0, 5.4375), 'Serbia': (87.5, 24.5, 3.5714285714285716), 'Bulgaria': (87.93617021276596, 14.645390070921986, 6.004358353510896), 'Italy': (88.56223132036847, 46.217621803872724, 1.9162005283652987), 'Portugal': (88.25021964505359, 35.137228194299844, 2.511587401176113), 'Peru': (83.5625, 18.0625, 4.626297577854671), 'Romania': (86.4, 15.241666666666667, 5.6686714051394205), None: (88.63492063492063, 28.645492778168947, 3.094201287487374), 'US': (88.56372009393806, 36.80110724188513, 2.4065504201226693), 'Georgia': (87.68604651162791, 20.92993160772165, 4.189504684252194), 'Spain': (87.28833709556058, 28.86762804139494, 3.0237446932041956), 'Bosnia and Herzegovina': (86.5, 12.5, 6.92), 'England': (91.58108108108108, 54.16377107059393, 1.6908180370550563)}
```


In [15]:
lowest_ratio = 100000000

for country in ["Cyprus", "Brazil", "India"]:
    if countries[country][2] < lowest_ratio:
        lowest_ratio = countries[country][2]
        lowest_ratio_country = country

print(f"Out of {', '.join(['Cyprus', 'Brazil', 'India'])}, the country with the lowest ratio of points to price is {lowest_ratio_country}, with a ratio of {lowest_ratio:.2f}")

Out of Cyprus, Brazil, India, the country with the lowest ratio of points to price is Brazil, with a ratio of 2.84


### Question 8 (1 point)

Create a list called `top_wines` that contains all the wines that have achieved the maximum rating (0.1 points)

* First calculate what's the maximum rating and then extract all the wine reviews that have that rating.
* The result should be a list of dictionaries, where each dictionary is a wine review, something like the following.

```python
top_wines = [
    {
        'country': 'France',
        'description': 'This is a top wine',
        'points': 100,
        'price': 100,
        'title': 'Top Wine A',
        'variety': 'Top Variety A',
        'winery': 'Top Winery A'
    },
    {
        'country': 'Italy',
        'description': 'This is another top wine',
        'points': 100,
        'price': 100,
        'title': 'Second Top Wine B',
        'variety': 'Top Variety B',
        'winery': 'Top Winery B'
    }
]
```

* What is the country with the most top wines? (0.3 points)
* What is the average price of a top wine? (0.3 points)
* What is the most present variety of top wine? (0.3 points)



In [16]:
highest_rating = max([float(wine['points']) for wine in wines])

top_wines = [wine for wine in wines if float(wine['points']) == highest_rating]

# 

# country with the most wines in the top
countries = {}

for wine in top_wines:
    if wine['country'] in countries:
        countries[wine['country']] += 1
    else:
        countries[wine['country']] = 1

country_with_most_top_wines = max(countries, key=countries.get)
print(f"The country with the most wines in the top is {country_with_most_top_wines}, with {countries[country_with_most_top_wines]} wines in the top")

# average price of wines in the top
prices = []

for wine in top_wines:

    if wine['price'] != None:
        prices.append(float(wine['price']))

avg_price = sum(prices) / len(prices)

print(f"The average price of a wine in the top is {avg_price:.2f}")

# most present variety in the top

varieties = {}

for wine in top_wines:
    if wine['variety'] in varieties:
        varieties[wine['variety']] += 1
    else:
        varieties[wine['variety']] = 1

most_present_variety = max(varieties, key=varieties.get)

print(f"The most present variety in the top is {most_present_variety}, with {varieties[most_present_variety]} wines in the top")

The country with the most wines in the top is France, with 8 wines in the top
The average price of a wine in the top is 485.95
The most present variety in the top is Bordeaux-style Red Blend, with 5 wines in the top


In [17]:
len(top_wines)

19

### Question 9 (1 point)

* Create a function called `affordable_wines` that receives the wines reviews list and a specific budget, and returns how many wines you can buy with that price. (0.5 points) 
* Create another function called `twitter_presence` that receives the wines reviews list and a wine name and returns True if the wine has a twitter handle for the taster, and False otherwise. (0.5 points)

Prove your functions with these examples:

* `affordable_wines(wines, 10)` should return 6280 wines in that budget
* `twitter_presence(wines, "Nicosia 2013 Vulkà Bianco  (Etna)")` should return `True``, meaning there is a twitter handle for the taster of that wine

In [18]:
def affordable_wines(wines, budget):
    on_bugdet = 0
    for wine in wines:
        if wine['price'] != None and float(wine['price']) <= budget:
            on_bugdet += 1  
    return on_bugdet

affordable_wines(wines, 10)

6280

In [19]:
def twitter_presence(wines, wine_name):
    for wine in wines:
        if wine['taster_twitter_handle'] != "":
            return True
    return False

twitter_presence(wines, "Nicosia 2013 Vulkà Bianco  (Etna)")

True

### Question 10 (1 point)

* Which is the most common variety of wine in the dataset? (0.3 points)
* Which is the most expensive wine in the dataset? (0.3 points)
* Which is, on average, the most expensive variety of wine in the dataset? (0.2 points)
* Which is the taster (other than `None`) that has reviewed the most wines? (0.2 points)

In [20]:
# 1 most common variety

unique_varieties = {wine['variety'] for wine in wines}

varieties = {}

for variety in unique_varieties:
    varieties[variety] = 0

    for wine in wines:
        if wine['variety'] == variety:
            varieties[variety] += 1

most_common_variety = max(varieties, key=varieties.get)

print(f"The most common variety is {most_common_variety}, with {varieties[most_common_variety]} wines out of {len(wines)} reviews")

The most common variety is Pinot Noir, with 13272 wines out of 129971 reviews


In [21]:
# 2 most expensive wine in the data

most_expensive_wine = ""
highest_price = 0

for wine in wines:
    if wine['price'] != None and float(wine['price']) > highest_price:
        highest_price = float(wine['price'])
        most_expensive_wine = wine['title']

print(f"The most expensive wine is {most_expensive_wine}, with a price of {highest_price}")

The most expensive wine is Château les Ormes Sorbet 2013  Médoc, with a price of 3300.0


In [22]:
# 3 most expensive variety of wine on average

unique_varieties = {wine['variety'] for wine in wines}

varieties = {}

for variety in unique_varieties:

    prices = []

    for wine in wines:
        if wine['variety'] == variety and wine['price'] != None:
            prices.append(float(wine['price']))

    avg_price = sum(prices) / len(prices)

    varieties[variety] = avg_price

most_expensive_variety = max(varieties, key=varieties.get)

print(f"The most expensive variety is {most_expensive_variety}, with an average price of {varieties[most_expensive_variety]:.2f}")

The most expensive variety is Ramisco, with an average price of 495.00


In [23]:
# 4 taster that has reviewed the most wines

unique_tasters = {wine['taster_name'] for wine in wines}

tasters = {}

for taster in unique_tasters:
    tasters[taster] = 0

    for wine in wines:
        if wine['taster_name'] == taster and wine['taster_name'] != None:
            tasters[taster] += 1

most_active_taster = max(tasters, key=tasters.get)

print(f"The most active taster is {most_active_taster}, with {tasters[most_active_taster]} reviews")

The most active taster is Roger Voss, with 25514 reviews
