Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
326 lines (298 sloc) 17.1 KB

Logistic Regression

Due: 11. September (23:55)

Overview

In this homework you'll implement a stochastic gradient ascent for logistic regression and you'll apply it to the task of determining whether documents are talking about hockey or baseball.

Hockey and Baseball: Are they really that different?

This will be slightly more difficult than the last homework (the difficulty will slowly ramp upward). You should not use any libraries that implement any of the functionality of logistic regression for this assignment; logistic regression is implemented in scikit learn, but you should do everything by hand now. You'll be able to use library implementations of logistic regression in the future (the next homework, even).

You'll turn in your code on ELMS. This assignment is worth 30 points.

What you have to do

Coding (25 points):

  1. Understand how the code is creating feature vectors (this will help you code the solution and to do the later analysis). You don't actually need to write any code for this, however.
  2. (Optional) Store necessary data in the constructor so you can do classification later.
  3. You'll likely need to write some code to get the best/worst features (see below).
  4. Modify the sg update function to perform non-regularized updates.
  5. Modify the sg update function so that it finds regularized updates.

Analysis (5 points):

  1. What is the role of the learning rate?
  2. How many passes over the data do you need to complete?
  3. What words are the best predictors of each class? How (mathematically) did you find them?
  4. What words are the poorest predictors of classes? How (mathematically) did you find them?

Extra credit:

  1. Use a schedule to update the learning rate.
    • Supply an appropriate argument to step parameter
    • Support it in your sg update
    • Show the effect in your analysis document
  2. Use document frequency (provided in the vocabulary file) to modify the feature values to tf-idf.
    • Modify the Example to store the df vector
    • With the appropriate flag, use the df vector rather than x in the update
    • Show the effect in your analysis document
  3. Implement lazy updating

Caution: When implementing extra credit, make sure your implementation of the regular algorithms doesn't change.

What to turn in

  1. Submit your logreg.py file (include your name at the top of the source)
  2. Submit your analysis.pdf file
    • no more than one page
    • pictures are better than text
    • include your name at the top of the PDF

Unit Tests

I've provided unit tests based on the example that we worked through in class. Before running your code on read data, make sure it passes all of the unit tests.

cs244-33-dhcp:logreg jbg$ python tests.py
.[ 0.  0.  0.  0.  0.]
[ 1.  4.  3.  1.  0.]
F
======================================================================
FAIL: test_unreg (__main__.TestKnn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 22, in test_unreg
    self.assertAlmostEqual(b1[0], .5)
AssertionError: 0.0 != 0.5 within 7 places

----------------------------------------------------------------------
Ran 2 tests in 0.001s

FAILED (failures=1)

Example

This is an example of what your runs should look like:

cs244-33-dhcp:logreg jbg$ python logreg.py
Read in 1064 train and 133 test
Update 1	TP -795.043334	HP -93.911814	TA 0.497180	HA 0.533835
Update 6	TP -724.253354	HP -93.249903	TA 0.572368	HA 0.518797
Update 11	TP -780.935341	HP -98.223121	TA 0.549812	HA 0.466165
Update 16	TP -732.288174	HP -89.181998	TA 0.565789	HA 0.601504
Update 21	TP -719.325439	HP -85.907924	TA 0.583647	HA 0.586466
Update 26	TP -719.021637	HP -83.958887	TA 0.594925	HA 0.586466
Update 31	TP -923.834700	HP -108.561626	TA 0.572368	HA 0.518797
Update 36	TP -718.097629	HP -84.612442	TA 0.631579	HA 0.654135
Update 41	TP -725.644667	HP -84.026452	TA 0.661654	HA 0.631579
Update 46	TP -664.738996	HP -79.949308	TA 0.686090	HA 0.646617
Update 51	TP -619.667363	HP -76.821255	TA 0.706767	HA 0.684211
Update 56	TP -619.743899	HP -75.954954	TA 0.709586	HA 0.646617
Update 61	TP -589.725505	HP -74.847902	TA 0.715226	HA 0.661654
Update 66	TP -584.875724	HP -74.032131	TA 0.728383	HA 0.661654
Update 71	TP -645.906845	HP -78.866905	TA 0.708647	HA 0.684211
Update 76	TP -526.786160	HP -70.306575	TA 0.749060	HA 0.684211
Update 81	TP -771.468262	HP -91.125394	TA 0.684211	HA 0.639098
Update 86	TP -537.432429	HP -67.215471	TA 0.755639	HA 0.714286
Update 91	TP -566.090520	HP -70.279963	TA 0.754699	HA 0.714286
Update 96	TP -484.811485	HP -66.442201	TA 0.758459	HA 0.736842
Update 101	TP -513.658442	HP -68.943852	TA 0.755639	HA 0.729323
Update 106	TP -501.243674	HP -68.363089	TA 0.775376	HA 0.744361
Update 111	TP -441.389660	HP -62.575458	TA 0.796992	HA 0.766917
Update 116	TP -439.558734	HP -63.039959	TA 0.803571	HA 0.781955
Update 121	TP -431.629278	HP -61.903585	TA 0.807331	HA 0.781955
Update 126	TP -414.894054	HP -59.988681	TA 0.816729	HA 0.781955
Update 131	TP -414.096355	HP -60.299108	TA 0.818609	HA 0.774436
Update 136	TP -414.953357	HP -63.413016	TA 0.820489	HA 0.789474
Update 141	TP -430.229841	HP -68.031491	TA 0.810150	HA 0.781955
Update 146	TP -413.538741	HP -65.211087	TA 0.829887	HA 0.781955
Update 151	TP -396.605257	HP -63.010010	TA 0.828947	HA 0.789474
Update 156	TP -387.900439	HP -62.535487	TA 0.834586	HA 0.789474
Update 161	TP -630.948522	HP -87.422856	TA 0.745301	HA 0.744361
Update 166	TP -460.561482	HP -68.322559	TA 0.803571	HA 0.766917
Update 171	TP -419.917408	HP -64.196351	TA 0.819549	HA 0.796992
Update 176	TP -364.486979	HP -57.533723	TA 0.845865	HA 0.804511
Update 181	TP -360.069486	HP -56.575153	TA 0.844925	HA 0.804511
Update 186	TP -358.216060	HP -58.647671	TA 0.862782	HA 0.804511
Update 191	TP -358.443440	HP -59.617043	TA 0.854323	HA 0.796992
Update 196	TP -347.712291	HP -56.790399	TA 0.864662	HA 0.819549
Update 201	TP -365.304127	HP -60.799011	TA 0.851504	HA 0.819549
Update 206	TP -446.383830	HP -73.102032	TA 0.812030	HA 0.781955
Update 211	TP -446.666095	HP -72.650355	TA 0.813910	HA 0.774436
Update 216	TP -363.252668	HP -60.006379	TA 0.858083	HA 0.804511
Update 221	TP -328.394830	HP -53.157330	TA 0.872180	HA 0.827068
Update 226	TP -519.809419	HP -58.968947	TA 0.818609	HA 0.789474
Update 231	TP -503.829932	HP -57.102820	TA 0.844925	HA 0.812030
Update 236	TP -501.128467	HP -57.658383	TA 0.829887	HA 0.819549
Update 241	TP -525.947720	HP -58.729245	TA 0.835526	HA 0.796992
Update 246	TP -623.632145	HP -69.934241	TA 0.829887	HA 0.766917
Update 251	TP -519.955174	HP -59.036940	TA 0.838346	HA 0.789474
Update 256	TP -532.697618	HP -60.154300	TA 0.836466	HA 0.796992
Update 261	TP -516.381767	HP -58.017598	TA 0.842105	HA 0.789474
Update 266	TP -503.833188	HP -56.245212	TA 0.844925	HA 0.819549
Update 271	TP -497.169120	HP -55.620452	TA 0.840226	HA 0.819549
Update 276	TP -433.555533	HP -53.154421	TA 0.836466	HA 0.819549
Update 281	TP -365.268184	HP -47.593577	TA 0.848684	HA 0.819549
Update 286	TP -355.258939	HP -47.344792	TA 0.855263	HA 0.812030
Update 291	TP -436.582135	HP -56.915134	TA 0.801692	HA 0.766917
Update 296	TP -392.833321	HP -52.702894	TA 0.832707	HA 0.789474
Update 301	TP -311.048331	HP -42.815755	TA 0.871241	HA 0.827068
Update 306	TP -326.612375	HP -45.353690	TA 0.863722	HA 0.842105
Update 311	TP -325.718712	HP -45.190506	TA 0.864662	HA 0.842105
Update 316	TP -330.993047	HP -45.866450	TA 0.860902	HA 0.842105
Update 321	TP -334.765816	HP -46.509667	TA 0.859962	HA 0.834586
Update 326	TP -307.208299	HP -41.905350	TA 0.873120	HA 0.864662
Update 331	TP -305.025206	HP -40.842411	TA 0.874060	HA 0.857143
Update 336	TP -316.113642	HP -42.098775	TA 0.861842	HA 0.857143
Update 341	TP -308.905784	HP -42.592735	TA 0.880639	HA 0.842105
Update 346	TP -301.315709	HP -44.000177	TA 0.876880	HA 0.827068
Update 351	TP -295.213006	HP -42.678737	TA 0.873120	HA 0.827068
Update 356	TP -293.054803	HP -42.181584	TA 0.875940	HA 0.827068
Update 361	TP -307.878874	HP -48.319265	TA 0.866541	HA 0.857143
Update 366	TP -281.415654	HP -42.710557	TA 0.876880	HA 0.842105
Update 371	TP -284.224805	HP -42.790451	TA 0.872180	HA 0.834586
Update 376	TP -297.730306	HP -45.348864	TA 0.866541	HA 0.849624
Update 381	TP -299.381950	HP -46.073845	TA 0.867481	HA 0.849624
Update 386	TP -266.978957	HP -38.198115	TA 0.890038	HA 0.849624
Update 391	TP -267.592430	HP -37.676200	TA 0.885338	HA 0.864662
Update 396	TP -266.486973	HP -37.265295	TA 0.892857	HA 0.864662
Update 401	TP -264.800773	HP -37.203395	TA 0.894737	HA 0.857143
Update 406	TP -264.895614	HP -37.091807	TA 0.894737	HA 0.864662
Update 411	TP -261.472232	HP -37.263451	TA 0.891917	HA 0.849624
Update 416	TP -270.030120	HP -39.855177	TA 0.885338	HA 0.872180
Update 421	TP -268.127615	HP -39.374658	TA 0.886278	HA 0.879699
Update 426	TP -270.696393	HP -40.065510	TA 0.883459	HA 0.879699
Update 431	TP -271.979033	HP -39.674723	TA 0.886278	HA 0.879699
Update 436	TP -273.757982	HP -40.091251	TA 0.885338	HA 0.879699
Update 441	TP -256.428089	HP -37.875637	TA 0.886278	HA 0.872180
Update 446	TP -265.486941	HP -40.280485	TA 0.886278	HA 0.872180
Update 451	TP -268.866867	HP -40.820852	TA 0.885338	HA 0.864662
Update 456	TP -254.401730	HP -37.252358	TA 0.886278	HA 0.879699
Update 461	TP -249.199570	HP -35.657289	TA 0.899436	HA 0.872180
Update 466	TP -250.042749	HP -35.811959	TA 0.895677	HA 0.879699
Update 471	TP -248.163421	HP -36.179052	TA 0.903195	HA 0.864662
Update 476	TP -252.550412	HP -37.644849	TA 0.898496	HA 0.864662
Update 481	TP -247.348780	HP -36.379875	TA 0.906015	HA 0.879699
Update 486	TP -248.399603	HP -37.045843	TA 0.899436	HA 0.864662
Update 491	TP -245.914828	HP -36.079985	TA 0.904135	HA 0.872180
Update 496	TP -235.327134	HP -35.575022	TA 0.903195	HA 0.887218
Update 501	TP -232.550895	HP -34.814766	TA 0.906015	HA 0.887218
Update 506	TP -238.172257	HP -35.715463	TA 0.900376	HA 0.894737
Update 511	TP -230.731758	HP -34.452442	TA 0.909774	HA 0.894737
Update 516	TP -219.919829	HP -35.034406	TA 0.918233	HA 0.887218
Update 521	TP -219.567390	HP -34.074965	TA 0.915414	HA 0.879699
Update 526	TP -220.114015	HP -34.350638	TA 0.921053	HA 0.909774
Update 531	TP -221.043738	HP -34.366944	TA 0.916353	HA 0.909774
Update 536	TP -231.274508	HP -36.335583	TA 0.916353	HA 0.902256
Update 541	TP -234.383762	HP -36.957993	TA 0.918233	HA 0.902256
Update 546	TP -235.365692	HP -37.182322	TA 0.918233	HA 0.894737
Update 551	TP -210.956885	HP -36.030627	TA 0.920113	HA 0.872180
Update 556	TP -206.827802	HP -34.031300	TA 0.927632	HA 0.894737
Update 561	TP -206.599103	HP -33.590585	TA 0.928571	HA 0.887218
Update 566	TP -227.765729	HP -39.128048	TA 0.925752	HA 0.864662
Update 571	TP -223.652923	HP -38.349921	TA 0.922932	HA 0.872180
Update 576	TP -214.171239	HP -36.168571	TA 0.926692	HA 0.887218
Update 581	TP -212.257189	HP -35.012516	TA 0.923872	HA 0.887218
Update 586	TP -203.270242	HP -32.419454	TA 0.928571	HA 0.902256
Update 591	TP -209.242059	HP -34.471181	TA 0.931391	HA 0.894737
Update 596	TP -204.750159	HP -32.605763	TA 0.933271	HA 0.917293
Update 601	TP -209.017823	HP -34.255211	TA 0.927632	HA 0.902256
Update 606	TP -209.143816	HP -34.425883	TA 0.926692	HA 0.902256
Update 611	TP -218.744265	HP -37.471917	TA 0.921053	HA 0.849624
Update 616	TP -207.618752	HP -34.613326	TA 0.928571	HA 0.894737
Update 621	TP -209.962968	HP -36.089992	TA 0.927632	HA 0.872180
Update 626	TP -215.329039	HP -37.521708	TA 0.926692	HA 0.864662
Update 631	TP -190.088163	HP -31.944084	TA 0.940789	HA 0.909774
Update 636	TP -190.315367	HP -32.219854	TA 0.938910	HA 0.917293
Update 641	TP -199.939166	HP -34.886538	TA 0.930451	HA 0.872180
Update 646	TP -202.746621	HP -35.861463	TA 0.925752	HA 0.864662
Update 651	TP -185.532332	HP -30.463159	TA 0.939850	HA 0.924812
Update 656	TP -184.424930	HP -30.069666	TA 0.941729	HA 0.924812
Update 661	TP -179.530082	HP -29.647254	TA 0.943609	HA 0.909774
Update 666	TP -191.254865	HP -33.979399	TA 0.935150	HA 0.909774
Update 671	TP -186.431419	HP -32.549066	TA 0.941729	HA 0.909774
Update 676	TP -187.922464	HP -33.023230	TA 0.937030	HA 0.909774
Update 681	TP -187.149745	HP -34.514251	TA 0.940789	HA 0.909774
Update 686	TP -188.801544	HP -35.110833	TA 0.938910	HA 0.902256
Update 691	TP -175.969196	HP -31.641947	TA 0.948308	HA 0.917293
Update 696	TP -181.056217	HP -34.162334	TA 0.947368	HA 0.894737
Update 701	TP -180.298186	HP -34.073575	TA 0.947368	HA 0.894737
Update 706	TP -177.680984	HP -33.058071	TA 0.948308	HA 0.909774
Update 711	TP -176.270240	HP -32.659614	TA 0.951128	HA 0.902256
Update 716	TP -176.192717	HP -33.066650	TA 0.947368	HA 0.902256
Update 721	TP -174.920907	HP -32.252009	TA 0.950188	HA 0.909774
Update 726	TP -179.566204	HP -33.883239	TA 0.943609	HA 0.894737
Update 731	TP -202.554506	HP -39.157441	TA 0.928571	HA 0.864662
Update 736	TP -174.134479	HP -31.248150	TA 0.949248	HA 0.902256
Update 741	TP -177.873641	HP -32.372352	TA 0.947368	HA 0.902256
Update 746	TP -174.715064	HP -31.604677	TA 0.950188	HA 0.902256
Update 751	TP -174.883966	HP -32.101748	TA 0.948308	HA 0.902256
Update 756	TP -169.236322	HP -30.478823	TA 0.950188	HA 0.902256
Update 761	TP -171.791875	HP -31.097613	TA 0.947368	HA 0.902256
Update 766	TP -166.135727	HP -29.914869	TA 0.951128	HA 0.917293
Update 771	TP -162.789146	HP -28.742155	TA 0.953947	HA 0.917293
Update 776	TP -155.620980	HP -25.268135	TA 0.957707	HA 0.939850
Update 781	TP -153.661777	HP -24.935922	TA 0.956767	HA 0.932331
Update 786	TP -153.515603	HP -24.967693	TA 0.956767	HA 0.932331
Update 791	TP -152.179871	HP -26.058377	TA 0.957707	HA 0.932331
Update 796	TP -152.797816	HP -26.782273	TA 0.958647	HA 0.924812
Update 801	TP -149.498759	HP -25.614607	TA 0.956767	HA 0.932331
Update 806	TP -148.787505	HP -25.406701	TA 0.957707	HA 0.932331
Update 811	TP -155.537793	HP -27.178570	TA 0.955827	HA 0.924812
Update 816	TP -156.595863	HP -27.494496	TA 0.954887	HA 0.924812
Update 821	TP -159.067447	HP -28.274931	TA 0.954887	HA 0.909774
Update 826	TP -163.762410	HP -29.271812	TA 0.953008	HA 0.902256
Update 831	TP -165.746199	HP -29.979424	TA 0.950188	HA 0.902256
Update 836	TP -155.041437	HP -26.561640	TA 0.955827	HA 0.932331
Update 841	TP -157.019287	HP -27.324175	TA 0.954887	HA 0.924812
Update 846	TP -156.920476	HP -27.295586	TA 0.954887	HA 0.924812
Update 851	TP -154.819910	HP -26.761268	TA 0.954887	HA 0.932331
Update 856	TP -153.613883	HP -26.350923	TA 0.955827	HA 0.932331
Update 861	TP -149.341293	HP -26.865991	TA 0.959586	HA 0.939850
Update 866	TP -144.534047	HP -25.942861	TA 0.959586	HA 0.962406
Update 871	TP -141.828335	HP -25.065911	TA 0.960526	HA 0.939850
Update 876	TP -141.676037	HP -24.658443	TA 0.959586	HA 0.939850
Update 881	TP -137.815877	HP -22.379398	TA 0.959586	HA 0.954887
Update 886	TP -137.131294	HP -22.374861	TA 0.955827	HA 0.947368
Update 891	TP -137.988095	HP -22.629840	TA 0.955827	HA 0.954887
Update 896	TP -135.409071	HP -22.605572	TA 0.958647	HA 0.962406
Update 901	TP -133.938345	HP -22.571577	TA 0.960526	HA 0.962406
Update 906	TP -133.996296	HP -22.939846	TA 0.962406	HA 0.954887
Update 911	TP -132.330257	HP -22.321340	TA 0.962406	HA 0.954887
Update 916	TP -132.328365	HP -22.480326	TA 0.963346	HA 0.962406
Update 921	TP -134.294122	HP -23.133354	TA 0.962406	HA 0.954887
Update 926	TP -134.286241	HP -23.084312	TA 0.962406	HA 0.954887
Update 931	TP -134.333999	HP -23.048530	TA 0.965226	HA 0.954887
Update 936	TP -135.791101	HP -23.759125	TA 0.964286	HA 0.954887
Update 941	TP -127.662426	HP -21.643895	TA 0.962406	HA 0.954887
Update 946	TP -127.404373	HP -21.752508	TA 0.961466	HA 0.954887
Update 951	TP -128.268071	HP -22.488936	TA 0.961466	HA 0.962406
Update 956	TP -132.638895	HP -23.404714	TA 0.964286	HA 0.962406
Update 961	TP -208.917642	HP -44.344104	TA 0.938910	HA 0.872180
Update 966	TP -151.577926	HP -31.174781	TA 0.959586	HA 0.924812
Update 971	TP -151.395835	HP -31.138677	TA 0.958647	HA 0.924812
Update 976	TP -143.654987	HP -29.721879	TA 0.962406	HA 0.932331
Update 981	TP -135.325731	HP -27.752690	TA 0.961466	HA 0.932331
Update 986	TP -123.063411	HP -25.057247	TA 0.967105	HA 0.932331
Update 991	TP -123.571766	HP -25.257429	TA 0.965226	HA 0.932331
Update 996	TP -167.356326	HP -35.649418	TA 0.947368	HA 0.894737
Update 1001	TP -172.011640	HP -36.470991	TA 0.946429	HA 0.887218
Update 1006	TP -147.274005	HP -31.710242	TA 0.956767	HA 0.887218
Update 1011	TP -138.394660	HP -29.736350	TA 0.959586	HA 0.909774
Update 1016	TP -122.331007	HP -25.646725	TA 0.967105	HA 0.924812
Update 1021	TP -121.507210	HP -25.588628	TA 0.967105	HA 0.924812
Update 1026	TP -120.812450	HP -25.503914	TA 0.967105	HA 0.924812
Update 1031	TP -116.732055	HP -24.594705	TA 0.970865	HA 0.932331
Update 1036	TP -116.699704	HP -24.685418	TA 0.968985	HA 0.939850
Update 1041	TP -114.495031	HP -24.064381	TA 0.964286	HA 0.932331
Update 1046	TP -114.502237	HP -24.088618	TA 0.964286	HA 0.932331
Update 1051	TP -114.773030	HP -24.270661	TA 0.964286	HA 0.932331
Update 1056	TP -114.832721	HP -24.318733	TA 0.964286	HA 0.932331
Update 1061	TP -112.590763	HP -23.805704	TA 0.965226	HA 0.932331

Hints

  1. As with the previous assignment, make sure that you debug on small datasets first (I've provided toy text in the data directory to get you started).
  2. Certainly make sure that you do the unregularized version first and get it to work well.
  3. Use numpy functions whenever you can to make the computation faster.