**Dealing with a large dataset on your computer** is a common challenge for data analysts. If you need to process or filter the data you can set the `chunksize=` argument of pandas `read_csv()` method to loop through and work with manageable chunks of the data.

If for example, you wanted to work with a large data file (HCPS_data.csv) to pull just the rows where the `HCPS Code` is **99213**, you could read that file to a chunk of 100,000 rows at a time, filter each chunk to the rows with the specified code, save the filtered results of each chunk to a list and concatenate them all together at the end. The syntax would look something like this:

```
code_99213_rows =[]

for chunk in pd.read_csv('HCPS_data.csv', chunksize = 100000):
    code_99213_rows.append(chunk[chunk['HCPS Code'] == '99213']) 
               
code_99213_df = pd.concat(code_99213_rows, ignore_index=True)
```
======================================================================   

To shrink the size of a file so that it loads more quickly, converting a text file (CSV) to binary might make sense. In python, you can work with data to minimize its footprint and then store the resulting object (dataframe) as a [pickle](https://docs.python.org/3/library/pickle.html) file.

In [8]:
import pandas as pd
import pickle

In [9]:
%%time
june_trip = pd.read_csv('../data/june_trip.csv')
june_trip.head()

Wall time: 4.57 s


Unnamed: 0,pubTimeStamp,companyName,tripRecordNum,sumdID,tripDuration,tripDistance,startDate,startTime,endDate,endTime,startLatitude,startLongitude,endLatitude,endLongitude,tripRoute,create_dt
0,2019-06-01 00:05:46.817000,Bird,BRD3572,PoweredSPI1T,4.0,328.084,2019-06-01 00:00:00,00:02:18.203333,2019-06-01 00:00:00,00:06:16.406666,36.1644,-86.7807,36.1636,-86.7802,"[[36.164679,-86.781089],[36.163693,-86.78011],...",2019-06-02 05:30:19.960000
1,2019-06-01 00:05:46.817000,Bird,BRD3571,Powered2I3MS,5.0,4921.26,2019-06-01 00:00:00,00:02:44.803333,2019-06-01 00:00:00,00:07:28.286666,36.1753,-86.7943,36.1753,-86.7943,"[[36.175367,-86.794232],[36.175367,-86.794232]...",2019-06-02 05:30:19.927000
2,2019-06-01 00:09:54,Gotcha,GOT1,Powered327,12.0,12.426575,2019-06-01 00:00:00,00:09:56,2019-06-01 00:00:00,00:21:56,36.161501,-86.77601,36.152529,-86.783742,"[[""36.16149"",""-86.77605000000001""]]",2019-06-06 22:23:08.673000
3,2019-06-01 00:10:46.957000,Bird,BRD3610,Powered8U1A6,2.0,0.0,2019-06-01 00:00:00,00:10:31.163333,2019-06-01 00:00:00,00:12:02.773333,36.164,-86.7807,36.1631,-86.7797,"[[36.163168,-86.779639]]",2019-06-02 05:30:20.283000
4,2019-06-01 00:10:46.957000,Bird,BRD3612,PoweredSPI1T,5.0,656.168,2019-06-01 00:00:00,00:07:21.430000,2019-06-01 00:00:00,00:12:30.913333,36.165,-86.7799,36.1659,-86.7778,"[[36.164951,-86.779836],[36.16494,-86.779456],...",2019-06-02 05:30:20.347000


### Now try to reduce the size of the file
- objects require the most space


In [10]:
june_trip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205627 entries, 0 to 205626
Data columns (total 16 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   pubTimeStamp    205627 non-null  object 
 1   companyName     205627 non-null  object 
 2   tripRecordNum   205627 non-null  object 
 3   sumdID          205627 non-null  object 
 4   tripDuration    205627 non-null  float64
 5   tripDistance    205627 non-null  float64
 6   startDate       205627 non-null  object 
 7   startTime       205627 non-null  object 
 8   endDate         205627 non-null  object 
 9   endTime         205627 non-null  object 
 10  startLatitude   205627 non-null  float64
 11  startLongitude  205627 non-null  float64
 12  endLatitude     205627 non-null  float64
 13  endLongitude    205627 non-null  float64
 14  tripRoute       205627 non-null  object 
 15  create_dt       205627 non-null  object 
dtypes: float64(6), object(10)
memory usage: 25.1+ MB


#### convert the company name to an integer 
- find the unique company names
- assign each company an integer (you can use a dictionary for this step)
- update the `companyname` column to store the integer id for each company

In [13]:
june_trip.companyName.unique()

array(['Bird', 'Gotcha', 'SPIN', 'Bolt Mobility', 'Lime', 'Lyft', 'JUMP'],
      dtype=object)

In [14]:
company_dict = {'Bird':0, 'Lyft': 1, 'Gotcha': 2, 'Lime': 3, 'Spin': 4, 'Jump': 5, 'Bolt': 6}

In [16]:
june_trip.companyName = june_trip.companyame.replace(company_dict)

AttributeError: 'DataFrame' object has no attribute 'companyname'

#### next convert `pubdatetime` to a datetime 

In [None]:
june_trip.pubdatetime = pd.to_datetime(may.pubdatetime)
june_trip.head(2)

#### Next remove unneeded data
#### keep just the scooters

In [None]:
june_trip.sumdgroup.unique()

In [None]:
june_trip_scooters = may.loc[may.sumdgroup.isin(['scooter', 'Scooter'])]

#### keep just the columns we want to work with

In [None]:
june_trip_scooters = may_scooters[['pubdatetime', 'latitude', 'longitude', 'sumdid', 'chargelevel', 'companyname']]

#### check `.info()` again

In [None]:
june_trip_scooters.info()

#### The only object datatype remaining is sumdid (an alphanumeric unique identifier)
- time to pickle

In [None]:
june_trip_scooters.to_pickle("../data/june_trip.pkl")

In [None]:
%%time
june_trip_test = pd.read_pickle("../data/june_trip.pkl")