Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

#More practice

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Read New York City property sales data
import pandas as pd
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

Your code starts here:

In [3]:
# Change column names: replace spaces with underscores
df.columns = df.columns.str.replace(' ', '_')

In [4]:
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [5]:
# Get Pandas Profiling Report
from pandas_profiling import ProfileReport
profile = ProfileReport(df, minimal=True).to_notebook_iframe()

profile

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=30.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [6]:
# Keep just the subset of data for the Tribeca neighborhood
# Check how many rows you have now. (Should go down from > 20k rows to 146)
df.shape

(23040, 21)

In [7]:
df['NEIGHBORHOOD'].unique()

array(['CHELSEA', 'FASHION', 'GREENWICH VILLAGE-WEST',
       'UPPER EAST SIDE (59-79)', 'UPPER EAST SIDE (79-96)',
       'UPPER WEST SIDE (96-116)', 'MORRIS PARK/VAN NEST',
       'PELHAM PARKWAY SOUTH', 'SCHUYLERVILLE/PELHAM BAY', 'WESTCHESTER',
       'WILLIAMSBRIDGE', 'BAY RIDGE', 'BOROUGH PARK', 'CANARSIE',
       'CROWN HEIGHTS', 'FLATBUSH-NORTH', 'MADISON', 'MIDWOOD',
       'OCEAN PARKWAY-NORTH', 'WILLIAMSBURG-EAST', 'WILLIAMSBURG-SOUTH',
       'ASTORIA', 'BAYSIDE', 'ELMHURST', 'FLORAL PARK', 'FOREST HILLS',
       'MASPETH', 'MIDDLE VILLAGE', 'QUEENS VILLAGE', 'RIDGEWOOD',
       'SOUTH JAMAICA', 'WEST NEW BRIGHTON', 'MIDTOWN EAST',
       'UPPER WEST SIDE (59-79)', 'UPPER WEST SIDE (79-96)',
       'CASTLE HILL/UNIONPORT', 'KINGSBRIDGE HTS/UNIV HTS',
       'MORRISANIA/LONGWOOD', 'MOTT HAVEN/PORT MORRIS', 'SOUNDVIEW',
       'THROGS NECK', 'WAKEFIELD', 'BEDFORD STUYVESANT', 'BENSONHURST',
       'BERGEN BEACH', 'BUSH TERMINAL', 'BUSHWICK', 'CLINTON HILL',
       'DYKER HEIG

In [8]:
df = df[df['NEIGHBORHOOD'] == 'TRIBECA']
df.shape

(146, 21)

In [9]:
# Q. What's the date range of these property sales in Tribeca?
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'])

In [10]:
df['SALE_DATE'].dt.date

220      2019-01-03
763      2019-01-07
996      2019-01-08
1276     2019-01-09
1542     2019-01-10
            ...    
22221    2019-04-24
22732    2019-04-29
22733    2019-04-29
22897    2019-04-30
22898    2019-04-30
Name: SALE_DATE, Length: 146, dtype: object

In [11]:
df['SALE_PRICE']

220       $   2,800,000
763       $   2,650,000
996             $   - 0
1276      $   1,005,000
1542     $   12,950,000
              ...      
22221     $   5,761,259
22732     $   2,600,000
22733       $   605,000
22897       $   960,000
22898       $   975,000
Name: SALE_PRICE, Length: 146, dtype: object

In [12]:
# The Pandas Profiling Report showed that SALE_PRICE was read as strings
# Convert it to integers
df['SALE_PRICE'] = df['SALE_PRICE'].str.strip('$').str.strip().str.replace(',','').str.replace('- ','').astype(int)

In [13]:
df['SALE_PRICE']

220       2800000
763       2650000
996             0
1276      1005000
1542     12950000
           ...   
22221     5761259
22732     2600000
22733      605000
22897      960000
22898      975000
Name: SALE_PRICE, Length: 146, dtype: int64

In [14]:
# Q. What is the maximum SALE_PRICE in this dataset?
df['SALE_PRICE'].max()

260000000

In [15]:
# Look at the row with the max SALE_PRICE
df[df['SALE_PRICE'] == df['SALE_PRICE'].max()]

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
6499,1,TRIBECA,08 RENTALS - ELEVATOR APARTMENTS,2,224,1,,D8,34 DESBROSSES STREET,,10013.0,283.0,3.0,286.0,36858,305542.0,2007.0,2,D8,260000000,2019-02-01


In [16]:
# Get value counts of TOTAL_UNITS
# Q. How many property sales were for multiple units?
df['TOTAL_UNITS'].value_counts()

1.0      131
0.0       11
5.0        1
286.0      1
8.0        1
3.0        1
Name: TOTAL_UNITS, dtype: int64

In [17]:
df[(df['TOTAL_UNITS']!= 1)&(df['TOTAL_UNITS']!= 0)]['TOTAL_UNITS'].value_counts().sum()

4

In [18]:
# Keep only the single units
df = df[df['TOTAL_UNITS'] == 1]

In [19]:
# Q. Now what is the max sales price? How many square feet does it have?
df[df['SALE_PRICE'] == df['SALE_PRICE'].max()]['GROSS_SQUARE_FEET']

9236    8346.0
Name: GROSS_SQUARE_FEET, dtype: float64

In [20]:
# Q. How often did $0 sales occur in this subset of the data?
df[df['SALE_PRICE'] == 0].count()
# There's a glossary here: 
# https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

# It says:
# A $0 sale indicates that there was a transfer of ownership without a 
# cash consideration. There can be a number of reasons for a $0 sale including 
# transfers of ownership from parents to children. 



BOROUGH                           15
NEIGHBORHOOD                      15
BUILDING_CLASS_CATEGORY           15
TAX_CLASS_AT_PRESENT              15
BLOCK                             15
LOT                               15
EASE-MENT                          0
BUILDING_CLASS_AT_PRESENT         15
ADDRESS                           15
APARTMENT_NUMBER                  15
ZIP_CODE                          15
RESIDENTIAL_UNITS                 15
COMMERCIAL_UNITS                  15
TOTAL_UNITS                       15
LAND_SQUARE_FEET                  15
GROSS_SQUARE_FEET                 15
YEAR_BUILT                        15
TAX_CLASS_AT_TIME_OF_SALE         15
BUILDING_CLASS_AT_TIME_OF_SALE    15
SALE_PRICE                        15
SALE_DATE                         15
dtype: int64

In [21]:
# Look at property sales for > 5,000 square feet
# Q. What is the highest square footage you see?
df[df['GROSS_SQUARE_FEET'] > 5000]['GROSS_SQUARE_FEET'].max()

39567.0

In [22]:
# What are the building class categories?
# How frequently does each occur?
df['BUILDING_CLASS_CATEGORY'].unique()

array(['13 CONDOS - ELEVATOR APARTMENTS',
       '15 CONDOS - 2-10 UNIT RESIDENTIAL',
       '16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT',
       '46 CONDO STORE BUILDINGS'], dtype=object)

In [23]:
df['BUILDING_CLASS_CATEGORY'].value_counts()

13 CONDOS - ELEVATOR APARTMENTS               121
15 CONDOS - 2-10 UNIT RESIDENTIAL               8
46 CONDO STORE BUILDINGS                        1
16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT      1
Name: BUILDING_CLASS_CATEGORY, dtype: int64

In [24]:
# Keep subset of rows:
# Sale price more than $0, 
# Building class category = Condos - Elevator Apartments
df = df[(df['SALE_PRICE'] > 0)& (df['BUILDING_CLASS_CATEGORY'] == '13 CONDOS - ELEVATOR APARTMENTS')]
# Check how many rows you have now. (Should be 106 rows.)
df.shape

(106, 21)

In [25]:
# Make a Plotly Express scatter plot of GROSS_SQUARE_FEET vs SALE_PRICE
import plotly.express as px

fig = px.scatter(x=df['GROSS_SQUARE_FEET'], y= df['SALE_PRICE'], trendline="ols")
fig.show()


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



In [26]:
# Add an OLS (Ordinary Least Squares) trendline,
# to see how the outliers influence the "line of best fit"
fig = px.scatter(x=df['GROSS_SQUARE_FEET'], y= df['SALE_PRICE'], trendline="ols")
fig.show();

In [27]:
# Look at sales for more than $35 million

# All are at 70 Vestry Street
# All but one have the same SALE_PRICE & SALE_DATE
# Was the SALE_PRICE for each? Or in total?
# Is this dirty data?


In [28]:
df[df['SALE_PRICE'] > 35000000]

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
8370,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1105,,R4,"70 VESTRY STREET, 3C",3C,10013.0,1.0,0.0,1.0,0,1670.0,2016.0,2,R4,36681561,2019-02-12
8371,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1123,,R4,"70 VESTRY STREET, 6C",6C,10013.0,1.0,0.0,1.0,0,1906.0,2016.0,2,R4,36681561,2019-02-12
8372,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1124,,R4,"70 VESTRY STREET, 6D",6D,10013.0,1.0,0.0,1.0,0,2536.0,2016.0,2,R4,36681561,2019-02-12
8373,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1125,,R4,"70 VESTRY STREET, 6E",6E,10013.0,1.0,0.0,1.0,0,2965.0,2016.0,2,R4,36681561,2019-02-12
8374,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1126,,R4,"70 VESTRY STREET, 6F",6F,10013.0,1.0,0.0,1.0,0,2445.0,2016.0,2,R4,36681561,2019-02-12
8375,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1127,,R4,"70 VESTRY STREET, 7A",7A,10013.0,1.0,0.0,1.0,0,2844.0,2016.0,2,R4,36681561,2019-02-12
8376,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1128,,R4,"70 VESTRY STREET, 7B",7B,10013.0,1.0,0.0,1.0,0,3242.0,2016.0,2,R4,36681561,2019-02-12
8377,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1129,,R4,"70 VESTRY STREET, 7C",7C,10013.0,1.0,0.0,1.0,0,1906.0,2016.0,2,R4,36681561,2019-02-12
8378,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1130,,R4,"70 VESTRY STREET, 7D",7D,10013.0,1.0,0.0,1.0,0,2536.0,2016.0,2,R4,36681561,2019-02-12
8379,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1131,,R4,"70 VESTRY STREET, 7E",7E,10013.0,1.0,0.0,1.0,0,2965.0,2016.0,2,R4,36681561,2019-02-12


In [29]:
# Make a judgment call:
# Keep rows where sale price was < $35 million

# Check how many rows you have now. (Should be down to 90 rows.)
df = df[df['SALE_PRICE'] < 35000000]
df.shape

(90, 21)

In [30]:
# Now that you've removed outliers,
# Look again at a scatter plot with OLS (Ordinary Least Squares) trendline
fig = px.scatter(x=df['GROSS_SQUARE_FEET'], y= df['SALE_PRICE'], trendline="ols")
fig.show();

In [31]:
# Select these columns, then write to a csv file named tribeca.csv. Don't include the index.


#My data

##Work form last assighnment

In [1]:
import pandas as pd
import numpy as np

data1 = pd.read_csv('https://raw.githubusercontent.com/nastyalolpro/project_data/master/build_week_2/winemag-data_first150k.csv')
data2 = pd.read_csv('https://raw.githubusercontent.com/nastyalolpro/project_data/master/build_week_2/winemag-data-130k-v2.csv')
uci_red = pd.read_csv('https://raw.githubusercontent.com/nastyalolpro/project_data/master/build_week_2/winequality-red.csv', delimiter=";")
uci_white = pd.read_csv('https://raw.githubusercontent.com/nastyalolpro/project_data/master/build_week_2/winequality-white.csv', delimiter=";")

In [2]:
def wrangle_target(X):
  X = X.copy()

  quality = []
  for i in X['points']:
    if i < 87:
      quality.append('low')
    elif (86 < i)&(i < 91):
      quality.append('medium')
    else:
      quality.append('high')

  X['quality'] = quality
  return X

data1 = wrangle_target(data1)
data2 = wrangle_target(data2)
data = pd.concat([data1, data2], join = 'outer')

data = data.reset_index(drop=True)
data = data.drop(columns = 'Unnamed: 0')

In [4]:
uci_red['color'] = 'red'
uci_white['color'] = 'white'

uci_wine = pd.concat([uci_red, uci_white])

uci_quality = []
for i in uci_wine['quality']:
  if i < 6:
    uci_quality.append('low')
  elif (5 < i)&(i < 7):
    uci_quality.append('medium')
  else:
    uci_quality.append('high')

uci_wine['quality_c'] = uci_wine['quality']
uci_wine['quality'] = uci_quality
uci_wine = uci_wine.reset_index(drop=True)

##wrangling

In [44]:
print(data.shape)
data.sample(2)

(280901, 17)


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,quality,taster_name,taster_twitter_handle,title,digits,splitted,years
177571,US,"At 81%, this could be a varietally labeled Mer...",BFM,91,53.0,Washington,Columbia Valley (WA),Columbia Valley,Bordeaux-style Red Blend,Dusted Valley,high,Sean P. Sullivan,@wawinereport,Dusted Valley 2013 BFM Red (Columbia Valley (WA)),"[81, 15]","[At, 81, this, could, be, a, varietally, label...",[]
6658,Germany,Initially quiet apple and lemon notes intensif...,Brauneberger Juffer Auslese,93,46.0,Mosel,,,Rieslaner,Fritz Haag,high,,,,[],"[Initially, quiet, apple, and, lemon, notes, i...",[]


In [4]:
from random import randint as r

#lets look at a couple random descriptions
for num in range(10):
  print(data['description'][r(1,280000)], '\n')

Fairly earthy and a little bit herbal on the nose, with aromas of tomatillo, oregano and light fruit that almost amount to a nice salsa fresca. The body is a little big and flat, but there's also generous fruit, size and finish. Richer than many Merlots in this price class, but with peppery, herb notes throughout. 

A big step up from the first time we tried this wine in 2007. Aromas are pure and fruity, while the palate is ripe and acid-rich, with juicy but tight plum, herb and berry flavors. Shows chocolate on the tail, and overall it is on the money. A blend of five grapes led by Syrah, Cabernet Sauvignon and Carmenère. 

For those who like the acquired taste of red sparkling wine, this is an enjoyable example. It balances its acidity with taut black-currant flavors, along with a firm core of tannins. 

Fresh Granny Smith apple and lemon notes are announced on the nose. On the palate they become fully fledged and weave their refreshing, uncompromisingly dry way across a slender dry 

In [3]:
import re
 
data['digits'] = [re.findall("\d+", data['description'][num]) for num in range(len(data['description']))]

In [70]:
data['splitted'] = [re.sub('\.|[^a-zA-Z0-9\n\.]', ' ', data['description'][num]).split() for num in range(len(data))]

In [64]:
#<generator object flatten at 0x7f86675aff10>
# from collections import Iterable
# def flatten(coll):
#     for i in coll:
#             if isinstance(i, Iterable) and not isinstance(i, basestring):
#                 for subc in flatten(i):
#                     yield subc
#             else:
#                 yield i
def flatten(A):
    rt = []
    for i in A:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt

In [74]:
l = 0
for row in range(len(data)):
  n=0
  for word in range(len(data['splitted'][row])): 
    if bool(re.search(r'\d', data['splitted'][row][word])):
      n=1
      l+=1
      data['splitted'][row][word] = re.findall(r"[^\W\d_]+|\d+", data['splitted'][row][word])
  if n == 1:
    data['splitted'][row] = flatten(data['splitted'][row])
    print(l)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
112006
112007
112008
112011
112013
112015
112016
112017
112018
112019
112020
112021
112023
112024
112025
112026
112027
112029
112030
112031
112033
112034
112036
112037
112038
112039
112041
112042
112043
112044
112045
112047
112048
112049
112050
112051
112052
112054
112055
112056
112057
112058
112059
112060
112065
112066
112067
112071
112072
112073
112074
112075
112076
112077
112080
112081
112082
112084
112085
112087
112088
112089
112090
112091
112092
112094
112096
112097
112098
112099
112100
112101
112103
112104
112105
112108
112109
112110
112112
112113
112114
112116
112121
112122
112124
112126
112128
112129
112134
112135
112136
112137
112139
112140
112142
112143
112145
112146
112147
112148
112149
112150
112151
112154
112155
112156
112158
112159
112160
112161
112163
112164
112166
112168
112169
112170
112171
112172
112174
112176
112177
112180
112183
112185
112186
112188
112189
112191
112195
112196
112197
112198
112199
1122

In [85]:
data['years'] = [list() for x in range(len(data.index))]

In [86]:
for row in range(len(data)):
  for digit in range(len(data['digits'][row])): 
    if int(data['digits'][row][digit]) > 2000:
      data['years'][row].append(data['digits'][row][digit])

with this one I tried to remove unhelpful letters

In [13]:
# for row in range(len(data)):
#   for word in range(len(data['splitted'][row])):
#     if bool(re.search(r'\d', data['splitted'][row][word])):
#       data['splitted'][row][word] = data['splitted'][row][word].replace('s', '').replace('th', '')


this piece of code splits every character 

In [26]:
# from itertools import chain

# for row in range(len(data)):
#   for word in range(len(data['splitted'][row])):
#     if bool(re.search(r'\d', data['splitted'][row][word])):
#       letters = ''.join(re.findall('([a-zA-Z])', data['splitted'][row][word]))
#       numbers = ''.join(re.findall('([0-9])', data['splitted'][row][word]))
#       if data['splitted'][row][word].index(letters) < data['splitted'][row][word].index(numbers):
#         l = [letters, numbers]
#         data['splitted'][row][word] = l
#         data['splitted'][row] = list(chain.from_iterable(data['splitted'][row]))
#       else:
#         l = [numbers, letters]
#         data['splitted'][row][word] = l
#         data['splitted'][row] = list(chain.from_iterable(data['splitted'][row]))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


ValueError: ignored

In [87]:
data.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,quality,taster_name,taster_twitter_handle,title,digits,splitted,years
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,high,,,,"[100, 2022, 2030]","[This, tremendous, 100, varietal, wine, hails,...","[2022, 2030]"
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,high,,,,[2023],"[Ripe, aromas, of, fig, blackberry, and, cassi...",[2023]
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,high,,,,[122],"[Mac, Watson, honors, the, memory, of, a, wine...",[]
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,high,,,,"[20, 30, 2032]","[This, spent, 20, months, in, 30, new, French,...",[2032]
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,high,,,,"[1200, 18, 2020]","[This, is, the, top, wine, from, La, B, gude, ...",[2020]


In [88]:
data.to_csv('data2.csv')

In [83]:
#for every row in the dataset
for row in range(len(data)): 
#for every number in tha row in data['digits']
  for num in range(len(data['years'][row])):
    #det index of the previous
    index = data['splitted'][row].index(data['years'][row][num]) - 1
    #index2 = data['splitted'][row].index(data['years'][row][num]) + 1
    #print(data['splitted'][row][index])
    data['years'][row].append(data['splitted'][row][index])
    print(data['years'][row])

['2022', '2030', 'Enjoy', '2022', '2030']


IndexError: ignored