The Luck of the Procrastinator, or How I Spent My Fall Vacation.
Like most; I’m inundated with email, the day before Chipy's Fall 2018 Mentorship Program application deadline, a subject line caught my eye. Last Chance! I thought, if this is my last chance, how on earth did I miss my first? I clicked to learn all about it, and without hesitation, I thought "Why not?" I had no great expectations I'd be selected. I find great expectations often lead to severe disappointment, it's the mantra which supports my flight-by-seat-of-pants lifestyle. When I received the acceptance email, I was pleasantly surprised, until it sunk in seconds later...there go my lazy fall afternoons playing Skyrim. But it's exactly what I need, structure—I need method to my madness.
I've been working with data for the last 15 years or so, and currently use SAP/BW, SQL Server, Informatica and Tableau at my job. My CS curriculum was in Java, but I haven't used it much since graduating. I've been dipping my toes in Python and R for data cleansing, and planned on exploring Python for automation, pipelining and testing for the ETL we do at my company. Chipy’s program came to my attention at an opportune time, since I struggle to make headway without an end goal during my free time. We use enterprise tools at my job, so I explore open source on my own time. I've never had a mentor, unless you count my keyboard. It’s not very responsive, and makes no-sense when I bang my head against it. Passing along what I've learned has always brought me great satisfaction, so I figured it is my duty to pay it forward to some lucky mentor.
So here goes nothing, or Dear Diary (Project Management Edition)
Kickoff Dinner 8/27/18
Chipy hosted a kick-off dinner for the start of the program. Again, I didn't know what to expect. There was no order to the mix-and-mingle; we sat, ate and great conversation ensued. Ever so often, the program managers would switch us around, and we'd continue. Midway, we stopped to receive a rundown of the program, after which we could stay and get to know our fellow cohort some more. I got a chance to meet my mentor. I briefly mentioned my conundrum, should I do a fun project, or should I do a practical project? The pragmatic in me won, and I decided to solve a data problem, which has been bugging me at work for quite a while. I was familiar with the domain, and I knew I could work on it incrementally. It was a project with a doable minimum viable product I could do within the timeline. I'm also trying to learn R and then there are the video games...sssshhhh, our secret
Week 1 9/3/18 Monday, Sept 3 Mentor/Mentee 1-on-1
Agenda: Discuss New Product Development & Vitality Index Project
Discussion: I met Jim J. at the agreed upon bat time and channel to go over "The Project." Ah, it seemed so simple at the time. In my experience, once you start coding logic, the wrenches start flying, but I have honed the skill of dodging wrenches. I felt we were in a good spot, one of the benefits of picking a practical project is less time is wasted agonizing over what data set to use or what questions to ask and answer. I like to pace myself; for a procrastinator, I am extremely disciplined under pressure. One of my favorite thoughts on learning is by yo-yo ma, something along the lines of learning incrementally and consistently, makes learning fun and painless. Um, does this apply to coding?
Phase 1 - Data Pipeline
Minimum Viable Product:
- De-dup a dataset of materials based on a set of criteria for N rolling years.
- Include records where first instance Previous Status = New & New Status = Active
- Exclude records where Previous Status = Null
- Exclude records where last instance New Status = Inactive | Discontinued
Create Python Script: Arguments - Date (N Years) Materials.csv, Purchases.csv, Exclusions.csv
Calculate Vitality Index
Blog Medium | Notebook on Github Can you guess which one won?
Phase 2 Data Validation
Phase 3 Visualization
Phase 4 UI
Python script | Executable | Web Service/Web part on SharePoint
Phase 5 Data Integration
Get familiar with Anaconda Nav, Notebook & Pandas
Set-up Anaconda, Python 3 Environment, Create Project Notebook & Explore dataset using Pandas
Week 2 9/10/18 Monday, Sept 10 Mentor/Mentee 1-on-1
Agenda: Discuss Week 1 Next Steps and Deliverables
Discussion: We trouble-shooted publishing the project notebook on Anaconda cloud. I love Anaconda, it's the next best thing since sliced bread! It lets folks dive right into coding without the barrier and headache of setting up your environment and getting your packages to work. We coded the first condition in Pandas, "Include records where first instance Previous Status = New" and called it a day.
Next Steps: Start working on Blog 1
Deliverables: Share notebook on Anaconda Cloud & continue coding criteria.
Thursday, Sept 13 ChiPy (optional)
Deliverables: Head-shot Due
Week 3 9/17/18 Monday, Sept 17 Mentor/Mentee 1-on-1
Agenda: Discuss Week 2 Next Steps and Deliverables
Discussion: While exploring the data I realized I was overthinking some of the criteria. Since the first step was to identify materials, which had become active at some point, I didn't need to worry about the previous status. The issue with identifying NPD materials is that the creation date in SAP can happen much earlier to the time when a material becomes sellable, i.e. is in active status. To further complicate things, the material status can flip-flop throughout its lifetime. Materials which have been discontinued or were inactive, can be resurrected—these I refer to as zombie parts. We have a BW table that is written to any time the status changes, materials are hence, almost duplicates (except for the change date), which means they are slightly unique.
We later found the other condition, exclude materials whose current status in inactive or discontinued, contained materials which were never activated in the first place, go figure. We also reviewed reading the date fields in the csv as dates; we would need them to be dates once we coded the N Rolling years part, and also to de-dup based on the earliest date, and exclude based on the latest date.
Next Steps: Explore the to_datetime method, continue working on Blog 1
Deliverables: Work on reading the date fields as dates
Talking is how people grow to understand each other ~ Dadisms
Thursday, Sept 20 Data Science Coding Workshop - Blog 1 Due
Discussion: Today was the first of three coding workshops for the mentorship program. I learned the general programming workshop originally scheduled for today was swapped for the Data Science workshop. I was really looking forward to it, as data science is the track I am most interested in. I arrived at Braintree around 15 minutes late, d’oh! (I work in Des Plaines, and have to get home and hop on the brown line). Again, I didn't know what to expect and missed what I believe was a brief explanation of this week's project. The project was a bit more webdevy than data science. I had expected a project along the lines of exploring a dataset, asking/answering questions, and perhaps an intro into pandas or visualization. No bother; it was an opportunity to learn none the less, as they broke us up into working groups. We worked off of one computer, the assignment required setting up an environment, and folks had a blend of Windows, Mac and Linux. I use Anaconda exclusively for now, at least on my work laptop. I got a chance to learn about kwargs, which oddly enough sounded like Klingon to me, majQa’!
The first blog post was also due today by midnight, I had hoped to use GitHub and wanted to make sure I could. I didn't want to use a blogging site, simply because I underutilize GitHub and wanted to get more experience. I later learned a Slack channel was created specifically for this cohort’s blog posts. Now I know, and knowing is half the battle. Anytime I embark on an ogre of a tech project, I find it consists of layers of skills to learn. When you peel one off, I find there is another skill underneath I need to learn to move forward. One of these simple, but necessary skills was markdown. Markdown is a lightweight markup language with plain text formatting syntax. I also learned of Github.io (previously GitHub pages) which can be formatted with Jekyll themes; which I applied. I found they didn't always render my markdown accordingly, though. Jekyll is a simple, blog-aware, static site generator for personal, project, or organization sites. Written in Ruby by Tom Preston-Werner, GitHub's co-founder.
Deliverables: Blog Post 1
Week 4 9/24/18 Saturday & Sunday September 22nd - 23rd Hacking for Justice Workshop
Discussion: This past weekend I attended the Cook County States Attorney's Office workshop, Hacking for Justice. It was a two day event, Saturday (8am - 8pm) and Sunday (8am - 5pm) focused on exploring SAO's public case-level data. It introduced critical R skills like data cleaning, graphing, and descriptive statistics. As such, I did not meet with my mentor this week, I normally work on my project on the weekend. The cool thing about programming is finding a blend of best tools for the job, no matter the language. Dplyr is pretty sweet, and so is the Tidyverse.
Week 5 10/1/18 Monday, October 1st Mentor/Mentee 1-on-1
Agenda: Discuss Week 3 Next Steps and Deliverables
The rest of the time, I spent playing with Pandas. I can honestly say Pandas is the bee’s knees! Once I figured out my functions conceptually, I sketched them out in my handy, dandy notebook and serial killer chicken scratch. This is the second version. You don't want to see the first.
Jim and I discussed data validation, in particular handling null values. He also suggested using list comprehensions to replace spaces with underscores in headers after reading a csv file.
A list comprehension is a syntactic construct available in some programming languages for creating a list based on existing lists. It follows the form of the mathematical set-builder notation (set comprehension) as distinct from the use of map and filter functions. ~Wikipedia
Next Steps: Read up on list comprehensions.
Deliverables: Start writing functions
Week 6 10/8/18
Discussion: Unfortunately, I am no longer off on Mondays, so I couldn't meet with my mentor this morning. We later settled on Monday evenings instead, and agreed to meet next Monday after work. In the meantime, I've been working on Blog 2. I'm now getting the hang of markdown and adding some extra stuffs--nothing too fancy. My pictures and emoji’s don't work in Jekyll though. I've pretty much finished the core functions I outlined, so there's that. Today I started to tackle another project at work, pdfs. I came across some Python packages, but they depend on Poppler, which runs best on Linux. Maybe when this mentorship concludes, I can dig into it some more. I'm beginning to feel trapped inside a Max Brooks book.
Thursday, Oct 11 ChiPy (optional)
Week 7 10/15/18 Monday, October 15th Mentor/Mentee 1-on-1
Agenda: Discuss Week 5 Next Steps and Deliverables
Discussion: What a difference 6 weeks make. I'm born and raised in Chicago, and each year the first sign of "Winter is Coming" still pisses me off--all broody, Jon Snow like. Why can't I live in San Diego! Then I put my big girl pants on and remember what a great time of year this is! Halloween is just around the corner and with it comes a plethora of creepiness. Oh, the horror! On with it.
Jim and I walked through my functions. The neat thing about modularizing code is it makes it easier to find logic flaws, plus add unit testing (note to self). I first started piecing together code in notebook; later I found my inactive/discontinued logic was out of order. I had to first create the data frame and group by last occurrence, than I could filter on only those records who's last occurrence was either inactive or discontinued. Jim also mentioned adding some log files. Something else I’d been meaning to do was add some validation to check for number of records and columns and such. The last thing I had to implement for the core program was the NPD function and the VI function. Basically, the NPD function takes in all the data frames (active, inactive/disc, exclusions) and returns a clean list of NPD materials I can then use to calc my vitality index. The VI function joins the NPD data frame and the purchases data frame; we do some simple math, and viola, out comes the VI. Simple, huh? Well that’s not the fun part, the fun part is the visualization, for next time. I also mentioned my PDF project. Jim suggested I explore Docker, since I would need to use Linux in order to use the poppler libraries. For sure; but it’s command line, so for now I’ll use virtual machine. I will be sampling this onion layer later. Mmmmmm, awesome, blossom...
Next Steps: Start researching python viz packages
Deliverables: Finish blog 2 and write the NPD and VI functions
The light at the end of the tunnel
Thursday, Oct 18 Webdev Coding Workshop- Blog 2 Due
Discussion: The second coding workshop focused on web development. It turns out the first workshop was indeed the general Python 101 centered event, rather than the data science one. These workshops build upon themselves, so it makes sense to order them accordingly. Either way, this week’s workshop was focusing on Flask. Flask is a web development microframework for Python. I arrived at Braintree dark and early this time, boo winter, woo hoo being on time! Again, we were broken up into teams, with each team led by a mentor. We worked well enough and we each had an opportunity to contribute. I would like to see the gender dynamic change in future mentorships. Not to generalize, but I feel the environment would be more collaborative if I wasn’t the only woman in the group each time. We used one laptop, in this case a Mac. Let’s just say, commands are different and muscle memory is difficult to overcome, it was a slight challenge.
Blog post 2 was due tonight. I worked on it incrementally; so after the workshop ended, I cleaned it up a bit and submitted uneventfully. I look forward to completing the MVP for my project, take the pressure off, and venture into more interesting topics in the next couple of weeks. With the MVP almost done, I must keep myself motivated, lest I run out of STEAM. Red Dead Redemption II comes out soon, so I plan on rewarding myself with game time after each milestone
Deliverables: Blog Post 2
Week 8 10/22/29 Monday, October 22th Mentor/Mentee 1-on-1
Agenda: Discuss Week 7 Next Steps and Deliverables
Discussion: Writer's block. Ok, with that out of the way... Folks say the first sentence is the most difficult to write, right? Well at least I didn't start with “All Work and No Play Makes Jack a Dull Boy.” Ah the Shining, one of my favorites and Halloween is approaching fast.
I completed the last major functions for my project, NPD and VI. Finally, I had an almost MVP. I used one date, a start date. I walked Jim through my code and we discussed how to add more functionality. Having the user enter the start date was a bit tedious, how about deriving the start date by N number of years? Yes, the user would enter the number of years they wanted to consider New Products, say 4 years, and I'd code to compute '10/01/2014' for VI through Q3 2018. But what if we wanted to measure the VI in a past state, say for 2016 for N Years. For this I would need to sketch out the logic (and identify the edge cases). I'd have to think about the second part. For now, I was to concentrate on the N years issue.
Next Steps: Sketch logic for past state.
Deliverables: Code N years.
Week 9 10/29/18 Tuesday, October 30th Mentor/Mentee 1-on-1
Agenda: Discuss Week 8 Next Steps and Deliverables
Discussion: It finally happened; last Saturday I fell sick, fever sick. I have a strong immune system, cultivated over the years through deliberate exposure; I am an anti-germaphobe. It seems to have worked, I rarely get sick. I avoid it like the plague, mostly ‘cause I hate being sick, especially fever sick. I forgot how awful it feels, it's an overall consuming malaise making it difficult to do anything beyond basic bodily functions. I first felt it Friday after work, I was fatigued, exhausted. I went to bed early and woke up to body aches and a fever. I was on a mission, I had to finish coding. I settled into a catatonic state, medicated, sat, hydrated and waited it out...all...day...long. When I woke Sunday, my fever had broken.
I got to work and quickly found the package I needed, relativedelta. In one line of code, I had what I needed
start_date = (end_date - relativedelta(years=years))
We ended up meeting on Tuesday rather than Monday, which was fine by me. It gave me more time to recover, even though I had to work on Monday. We talked about the previous state logic, the end_date in particular, and how to filter the inactive/disc. I had added the date filtering to my VI function, once I had the active data frame and inactive/disc dataframe. Jim suggested I move it to my materials function. Mind you, my brain was still medicated and foggy, but I knew there was a reason I could not trim the materials dataframe before I filtered active and inactive/disc. I would have to sketch it out some more once my brain was at 100%. In the meantime, now that my MVP was complete, I asked Jim to look at refactoring. He suggested I move my calls into a main function and my parameters into a dictionary.
Next Steps: Re-sketch logic for past state.
Deliverables: Get refactor to work
Week 10 11/5/18 Thursday, Thursday 8th Mentor/Mentee 1-on-1 (remote)
Agenda: Discuss Week 9 Next Steps and Deliverables
Discussion: Today was the last day of early voting. I arrived at my polling place a little before 4:30 pm. I did not expect to see the long line, which wrapped around the inside perimeter of the McFetridge Sports Center. It was unlikely I'd be able to make our meeting, and I was right, it was two hours before I was done voting--but it was worth it! We settled on a remote session on Thursday.
I took all 3 file-path variables out of the functions along with the rest of my variables and moved them to the top of the program. I ended up returning my function calls after each function, since I couldn’t get the main function to work. I kept getting variable scope issues. After some troubleshooting with Jim, he pointed out my functions and the resulting dataframes were named the same, which Python allowed. I guess that's the danger of a dynamically typed language rather than statically typed language like Java. I sketched out the past state logic and confirmed my initial suspicion.
Next Steps: Think about what and how I'd like to visualize this data.
Deliverables: Code past state
Thursday, Nov 8 ChiPy (optional)
Week 11 11/12/18 Monday, November 12th Mentor/Mentee 1-on-1
Agenda: Discuss Week 10 Next Steps and Deliverables
Discussion: With my MVP complete we spent today's session discussing types of visualizations. The simplest thing to start with would be to plot the VI for the past N years in a line graph. Hmm, not too exciting. I wanted to plot NPD versus all sales over time, and figured I'd most likely use a stacked bar chart but wanted to research some other options. It's a shame the only geospatial data I had was 3-digit zipcodes. These would be difficult to plot, but I'd leave that to future exploration. For now, I had to focus on learning the ins and outs of Python's visualization packages. I would be out of town next week for Thanksgiving, so my time would be limited.
Next Steps: Learn Python visualization packages
Deliverables: Make some plots
Thursday, Nov 15 Data science Coding Workshop - Blog 3 Due
Week 12 11/19/18
Week 13 11/26/18
Week 14 12/3/18
Week 15 12/10/18
Thursday, Dec 13 Final Presentations