**Dealing with a large dataset on your computer** is a common challenge for data analysts. If you need to process or filter the data you can set the `chunksize=` argument of pandas `read_csv()` method to loop through and work with manageable chunks of the data.

If for example, you wanted to work with a large data file (HCPS_data.csv) to pull just the rows where the `HCPS Code` is **99213**, you could read that file to a chunk of 100,000 rows at a time, filter each chunk to the rows with the specified code, save the filtered results of each chunk to a list and concatenate them all together at the end. The syntax would look something like this:

```
code_99213_rows =[]

for chunk in pd.read_csv('HCPS_data.csv', chunksize = 100000):
    code_99213_rows.append(chunk[chunk['HCPS Code'] == '99213']) 
               
code_99213_df = pd.concat(code_99213_rows, ignore_index=True)
```
======================================================================   

To shrink the size of a file so that it loads more quickly, converting a text file (CSV) to binary might make sense. In python, you can work with data to minimize its footprint and then store the resulting object (dataframe) as a [pickle](https://docs.python.org/3/library/pickle.html) file.

In [None]:
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import numpy as np

In [None]:
%%time
may = pd.read_csv('../data/may.csv')
may.head()

### Now try to reduce the size of the file
- objects require the most space


In [None]:
may.info()

#### convert the company name to an integer 
- find the unique company names
- assign each company an integer (you can use a dictionary for this step)
- update the `companyname` column to store the integer id for each company

In [None]:
may.companyname.unique()

In [None]:
company_dict = {'Bird':0, 'Lyft': 1, 'Gotcha': 2, 'Lime': 3, 'Spin': 4, 'Jump': 5, 'Bolt': 6}

In [None]:
may.companyname = may.companyname.replace(company_dict)

#### next convert `pubdatetime` to a datetime 

In [None]:
may.pubdatetime = pd.to_datetime(may.pubdatetime)
may.head(2)

#### Next remove unneeded data
#### keep just the scooters

In [None]:
may.sumdgroup.unique()

In [None]:
may_scooters = may.loc[may.sumdgroup.isin(['scooter', 'Scooter'])]

#### keep just the columns we want to work with

In [None]:
may_scooters = may_scooters[['pubdatetime', 'latitude', 'longitude', 'sumdid', 'chargelevel', 'companyname']]

#### check `.info()` again

In [None]:
may_scooters.info()

#### The only object datatype remaining is sumdid (an alphanumeric unique identifier)
- time to pickle

In [None]:
may_scooters.to_pickle("../data/may.pkl")

In [None]:
%%time
may_test = pd.read_pickle("../data/may.pkl")

In [None]:
may_scooters.head(2)

In [None]:
may_scooters.companyname.value_counts()

In [None]:
may_bird = may_scooters[may_scooters.companyname.eq(0)]

In [None]:
may_bird.head(2)

In [None]:
may_bird.companyname.unique()

In [None]:
may_bird.shape

In [None]:
may_bird_XWRWC = may_bird[may_bird.sumdid.eq('PoweredXWRWC')]

In [None]:
may_bird_XWRWC.shape

In [64]:
may_bird_XWRWC.head()

Unnamed: 0,pubdatetime,latitude,longitude,sumdid,chargelevel,companyname
1,2019-05-01 00:01:41.247,36.191252,-86.772945,PoweredXWRWC,35.0,0
2539,2019-05-01 00:06:41.537,36.191252,-86.772945,PoweredXWRWC,35.0,0
4719,2019-05-01 00:11:41.777,36.191318,-86.772926,PoweredXWRWC,35.0,0
6694,2019-05-01 00:16:42.133,36.191318,-86.772926,PoweredXWRWC,35.0,0
9146,2019-05-01 00:21:42.137,36.191284,-86.772943,PoweredXWRWC,35.0,0


In [None]:
plt.scatter(x=may_bird_XWRWC.longitude,
           y=may_bird_XWRWC.latitude)

In [None]:
may_bird.info()

In [None]:
may_scooters.isnull().sum()

In [63]:
may_scooters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20283582 entries, 0 to 20292502
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   pubdatetime  datetime64[ns]
 1   latitude     float64       
 2   longitude    float64       
 3   sumdid       object        
 4   chargelevel  float64       
 5   companyname  int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 1.6+ GB


In [62]:
help(plt.bar)

Help on function bar in module matplotlib.pyplot:

bar(x, height, width=0.8, bottom=None, *, align='center', data=None, **kwargs)
    Make a bar plot.
    
    The bars are positioned at *x* with the given *align*\ment. Their
    dimensions are given by *height* and *width*. The vertical baseline
    is *bottom* (default 0).
    
    Many parameters can take either a single value applying to all bars
    or a sequence of values, one for each bar.
    
    Parameters
    ----------
    x : float or array-like
        The x coordinates of the bars. See also *align* for the
        alignment of the bars to the coordinates.
    
    height : float or array-like
        The height(s) of the bars.
    
    width : float or array-like, default: 0.8
        The width(s) of the bars.
    
    bottom : float or array-like, default: 0
        The y coordinate(s) of the bars bases.
    
    align : {'center', 'edge'}, default: 'center'
        Alignment of the bars to the *x* coordinates:
    
   